Page 283 - Ai Book - 10
P. 283

The various steps used to normalise textual data are:

        Sentence Segmentation/ Sentence Tokenization

        Sentence segmentation, also known as sentence tokenization, refers to the process of splitting the whole corpus
        into sentences or smaller units known as tokens. Examples of tokens can be words, characters, numbers, symbols,
        or n-grams.

































        Elimination of Stopwords, Special Characters and Numbers

        The whole corpus is segmented into sentences and then tokens among which few tokens are necessary. In this
        step, unnecessary tokens are removed from the token list. These unnecessary tokens are stop words, special
        characters and symbols.
        Stopwords or conventional symbols occur very frequently in the corpus but do not add any value to it. Humans
        use grammar  to make their sentences  meaningful  but  grammatical words  such  as articles, prepositions,
        connectors,etc. do not add meaning to the text which is to be transmitted through the statement, hence they
        come under the category of stopwords. Some examples of stopwords are:













        Changing Letter Case

        After eliminating stop words from tokens, we change the letters of the whole text into lowercase to eliminate
        the issue of case-sensitivity, i.e. machine does not recognise same words as different just because of different
        cases.

         u   Stemming: Stemming is an elementary rule based process to remove the affixes of words. For example,
             laughing, laughed, laughs, laugh all will become laugh after the stemming process.

                                                                                                             157
                                                                                                             157
   278   279   280   281   282   283   284   285   286   287   288