Page 285 - Ai Book - 10
P. 285

‘Binary’ = 2,

              ‘code’ =3,
              ‘is’=2,
              ‘an’=1,
              ‘a’ = 1

              ‘efficient’=1,
              ‘language’=1,

              ‘used’=1,
              ‘in’=1,
              ‘digital’ = 1,
              ‘computers’ = 1

        Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
        in the training set.

              Knowledge Botwledge Bot
              Kno
          Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.

        Bag of Words Algorithm

        To implement the Bag of Words algorithm, follow the given steps:
          Step I:  Text Normalisation

                   The whole corpus is segmented into tokens and removal of stopwords will take place.
          Step II:  Create Dictionary
                   Make a list of all unique words occurring in the corpus.
          Step III:  Create Vectors for each document
                   Generate vectors for each document in the corpus by determining the frequency of words in the
                  document.
          Step IV:  Create document vectors for all the documents
                  The last step is to generate vectors for all the documents that exist in the corpus.
        Example:
        Let us understand all the steps of BoW algorithm with the help of an example given below:
        Step I: Text Normalisation

        Suppose a corpus contains three document such as:

              Document 1: Subin and Sohan are friends.
              Document 2: Subin went to school.
              Document 3: Sohan went to park.
        Here are three documents having one sentence each. After text normalisation, the text becomes:
              Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
              Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
              Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
        Here, you have seen that no tokens have been removed in the stopwords removal step because we have very
        little data and frequency of words is almost the same.

                                                                                                             159
                                                                                                             159
   280   281   282   283   284   285   286   287   288   289   290