Page 283 - Ai Book - 10
P. 283
The various steps used to normalise textual data are:
Sentence Segmentation/ Sentence Tokenization
Sentence segmentation, also known as sentence tokenization, refers to the process of splitting the whole corpus
into sentences or smaller units known as tokens. Examples of tokens can be words, characters, numbers, symbols,
or n-grams.
Elimination of Stopwords, Special Characters and Numbers
The whole corpus is segmented into sentences and then tokens among which few tokens are necessary. In this
step, unnecessary tokens are removed from the token list. These unnecessary tokens are stop words, special
characters and symbols.
Stopwords or conventional symbols occur very frequently in the corpus but do not add any value to it. Humans
use grammar to make their sentences meaningful but grammatical words such as articles, prepositions,
connectors,etc. do not add meaning to the text which is to be transmitted through the statement, hence they
come under the category of stopwords. Some examples of stopwords are:
Changing Letter Case
After eliminating stop words from tokens, we change the letters of the whole text into lowercase to eliminate
the issue of case-sensitivity, i.e. machine does not recognise same words as different just because of different
cases.
u Stemming: Stemming is an elementary rule based process to remove the affixes of words. For example,
laughing, laughed, laughs, laugh all will become laugh after the stemming process.
157
157