Page 287 - Ai Book - 10
P. 287

u   Term Frequency
         u   Inverse Document Frequency
        Let us understand each term one by one in detail.
        Term Frequency

        Term frequency is a technical term used for the frequency of a word in one document. It can be easily found
        from the document vector table.

            Subin       and        Sohan         are       friends      went         to        school       park
              1           1           1           1           1           0           0           0           0
              1           0           0           0           0           1           1           1           0
              0           0           1           0           0           1           1           0           1
        In this table, we have mentioned the number ‘1’ for the words that exist in the document and the number ‘0’ for
        unmatched words. These numbers are nothing but the term frequencies.

        Inverse Document Frequency

        Inverse  Document Frequency  is a measure  of the  rareness  of a term.  Conceptually,  we  start  by  measuring
        document frequency.
        Document Frequency is the number of documents in which the word occurs irrespective of how many times it
        has occurred in those documents. The document frequency for the above table would be:
        Document Frequency:

            Subin       and        Sohan         are       friends      went         to        school       park
              2           1           2           1           1           2           2           1           1
        To compute inverse document frequency, we divide the total number of documents by the document frequency
        as shown below:

        Inverse document frequency:
            Subin       and        Sohan         are       friends      went         to        school       park
             3/2         3/1         3/2         3/1         3/1         3/2         3/2         3/1         3/1

        Finally, the formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log( IDF(W) )

        Where, W=words    TF=Term frequency     IDF=Inverse document frequency
        Here, log is to the base of 10. Use log function in the calculator to calculate the log values.
        Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document while the IDF
        values are for the whole corpus. Hence, we need to multiply the IDF values to each row of the document vector
        table.

            Subin       and        Sohan         are       friends      went         to        school       park
         1*log(3/2) 1*log(3/1) 1*log(3/2) 1*log(3/1) 1*log(3/1) 0*log(3/2) 0*log(3/2) 0*log(3/1) 0*log(3/1)
         1*log(3/2) 0*log(3/1) 0*log(3/2) 0*log(3/1) 0*log(3/1) 1*log(3/2) 1*log(3/2) 1*log(3/1) 0*log(3/1)
         0*log(3/2) 0*log(3/1) 1*log(3/2) 0*log(3/1) 0*log(3/1) 1*log(3/2) 1*log(3/2) 0*log(3/1) 1*log(3/1)

        After calculating values, the table look like as follows:
         Subin       and         Sohan       are         friends     went        To          school      park
         0.176       0.477       0.176       0.477       0.477       0           0           0           0


                                                                                                             161
                                                                                                             161
   282   283   284   285   286   287   288   289   290   291   292