Basic Learning of Natural Language Processing : These Step before processing and dealing with any text-formatted Data
- Tokenization : Process of separating a piece of text into smaller units called tokens
- Stemming : Process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form
- Lemmatization : In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words.
- Removing Stop words : What is stop words? is a list of collection word which does not add much meaning to a sentence. These word can safely be ignored without sacrificing the meaning of the sentence.
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas.
- Lemmatize the word 'Caring', it would return 'Care'. But stem, it would return 'Car'.
- Lemmatize the word 'Stripes' in verb context, it would return 'Strip'. Lemmatize return a noun context, and would return 'Stripe'. whereas stem it, it would just return 'Strip'
- Lemmatization is computationally expensive.