Saturday, March 9, 2013

Vector Creating From A Document Set


Matrix representation of the document vectors:




There are three different methods to express a text as a vector; by Binary Frequency, by Frequency and by Term Frequency–Inverse Document Frequency (Tf-Idf) Weighting Method (Pilavcılar, 2007).
Binary Frequency
This method creates an indicator vector indicating whether words in key-words-dictionary of a text are in the text or not. For example, let {inflation, the flu, the referee, medicine, fans, agricultural} be key-words-dictionary of a text and. Then we would identify the possible vectors as follows (Pilavcılar, 2007).
D1 = (0,1,0,1,0,0)
D2 = (0,0,0,1,0,0)
D3 = (1,0,0,0,0,0)
D4 = (0,0,0,0,0,1)
D5 = (0,0,1,0,0,0)
D6 = (0,0,0,1,1,0)
Term Frequency
This method is based on how many times a word is used in a text. For the same text we would identify the possible vectors as follows (Pilavcılar, 2007).

D1 = (0,2,0,1,0,0)
D2 = (0,0,0,1,0,0)
D3 = (1,0,0,0,0,0)
D4 = (0,0,0,0,0,2)
D5 = (0,0,2,0,0,0)
D6 = (0,0,0,1,2,0)
Term Frequency–Inverse Document Frequency (Tf-Idf) Weighting Method
The term weight is associated with importance in a text. If a word occurs frequently in a text, then it can be said that the word is significant for the text. Since term frequency (tf) is an importance indicator, we can use this measure as the term weight (Salton & Singhal, 1995). In a similar way, the Inverse Document Frequency (idf) can be used as a weight. Sometimes marginal words in a text may be more important than a word having high frequency. For example, the term fish might not be very useful when we deal with a collection of articles on Economic Sources of a Mediterranean Country, however it might be very important when deal with a collection of articles about the fossils found in an Egypt desert. Therefore, even if a word does not occur so much, it may be considered as a decisive word (Ilhan, Duru, Karagöz & Sagir, 2008). In information retrieval and text mining the tf–idf weights are often used to evaluate how a word is significant to a text (Karen, 1972).



1-     Pilavcılar İ.F., (Yildiz Technical Univ. FBE, MA). (2007). Thesis : "Text Mining and Text Classification ".
2-     Salton, G. & Singhal, A., (Department of Computer Science, Cornell University). (1995). Automatic Text   
         Browsing using Vector Space Model, May 1995.
3-     Ilhan,  S., Duru, N.,  Karagöz, S., & Sagir, M., (Faculty of Engineering, Department of Computer Engineering,  
         Kocaeli University, T.R.). (2008). Text Mining with the Question Answering System.

No comments:

Post a Comment