Page 285 - Ai Book

Page 285 - Ai Book - 10

P. 285

‘Binary’ = 2,

‘code’ =3,
‘is’=2,
‘an’=1,
‘a’ = 1

‘efficient’=1,
‘language’=1,

‘used’=1,
‘in’=1,
‘digital’ = 1,
‘computers’ = 1

Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
in the training set.

Knowledge Botwledge Bot
Kno
Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.

Bag of Words Algorithm

To implement the Bag of Words algorithm, follow the given steps:
Step I: Text Normalisation

The whole corpus is segmented into tokens and removal of stopwords will take place.
Step II: Create Dictionary
Make a list of all unique words occurring in the corpus.
Step III: Create Vectors for each document
Generate vectors for each document in the corpus by determining the frequency of words in the
document.
Step IV: Create document vectors for all the documents
The last step is to generate vectors for all the documents that exist in the corpus.
Example:
Let us understand all the steps of BoW algorithm with the help of an example given below:
Step I: Text Normalisation

Suppose a corpus contains three document such as:

Document 1: Subin and Sohan are friends.
Document 2: Subin went to school.
Document 3: Sohan went to park.
Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
Here, you have seen that no tokens have been removed in the stopwords removal step because we have very
little data and frequency of words is almost the same.

159
159

280 281 282 283 284 285 286 287 288 289 290