Page 286 - Ai Book - 10
P. 286

Step II: Create Dictionary

            In this step, we make a list of all the words which occur in all the three documents such as:
                     Subin      and       Sohan       are     friends     went        to      school      park

            While creating a list, you should always remember that the words which are repeated in different documents
            must be written once as the list contains only unique words.
            Step III: Create Document Vector

            In this step, we create a document vector by writing a vocabulary and the frequency of each word in the document
            such as:

                     Subin      and       Sohan       are     friends     went        to       school     park
                       1         1          1          1         1          0         0          0          0
            Here, we have mentioned the value ‘1’ for those words which exist in Document 1.

            Step IV: Create Document Vector for all the Documents
            In this step, we create a document vector for all the documents such as:

                    Subin       and      Sohan        are     friends     went        to       school     park
                       1         1          1          1         1          0          0         0          0
                       1         0          0          0         0          1          1         1          0

                       0         0          1          0         0          1          1         0          1
            Here, you have seen that the header row contains the vocabulary of the corpus and three rows correspond to
            three different documents. This is the final document vector table for our corpus. But the tokens have still not
            converted to numbers. This leads us to the final steps of our algorithm: TFIDF, a technique to extract important
            and relevant information from the corpus.
            TFIDF: Term Frequency & Inverse Document Frequency

            In the previous section, you have learnt that the frequency of words in each document can be determined
            through an algorithm called ‘Bags of Word’. It gives us an idea that if the word is occurring more in a document,
            its value is more for that document. Let us understand with the help of a simple example.
            Suppose we have a text document on the topic “Ozone Layer Depletion” among which the word ‘ozone’ is
            important and occurs many times. This word is valuable as it gives us some context around the document. On
            the other hand, suppose we have 10 documents and all of them talk about different issues such as gender bias,
            poverty, unemployment and so on.
            In both cases, some words like a, an, the, this, is, it, etc. occur the most in almost all the documents. But these
            words are not valuable and must be removed at the text normalisation stage.
            You should always remember the following things while determining whether the word is valuable or not:

             u   Stop words have negligible value as these words have highest occurrence in all the documents existing in a
                corpus.

             u   Frequent words have adequate occurrence in the corpus and these words mostly talk about the document’s
                subject.
             u   Rare or valuable words have least occurrence in the documents but add a lot of value to the corpus.
            In short, TFIDF (Term Frequency  and Inverse Document Frequency)is a statistical measure used to identify the
            value for each word. As its name implies, it is made up of two terminologies:


                160
                160
   281   282   283   284   285   286   287   288   289   290   291