Page 285 - Ai Book - 10
P. 285
‘Binary’ = 2,
‘code’ =3,
‘is’=2,
‘an’=1,
‘a’ = 1
‘efficient’=1,
‘language’=1,
‘used’=1,
‘in’=1,
‘digital’ = 1,
‘computers’ = 1
Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
in the training set.
Knowledge Botwledge Bot
Kno
Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.
Bag of Words Algorithm
To implement the Bag of Words algorithm, follow the given steps:
Step I: Text Normalisation
The whole corpus is segmented into tokens and removal of stopwords will take place.
Step II: Create Dictionary
Make a list of all unique words occurring in the corpus.
Step III: Create Vectors for each document
Generate vectors for each document in the corpus by determining the frequency of words in the
document.
Step IV: Create document vectors for all the documents
The last step is to generate vectors for all the documents that exist in the corpus.
Example:
Let us understand all the steps of BoW algorithm with the help of an example given below:
Step I: Text Normalisation
Suppose a corpus contains three document such as:
Document 1: Subin and Sohan are friends.
Document 2: Subin went to school.
Document 3: Sohan went to park.
Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
Here, you have seen that no tokens have been removed in the stopwords removal step because we have very
little data and frequency of words is almost the same.
159
159