Matrix representation of the document
vectors:
There are three different methods to
express a text as a vector;
by Binary Frequency, by Frequency and by Term Frequency–Inverse Document
Frequency (Tf-Idf) Weighting Method (Pilavcılar, 2007).
Binary Frequency
This method creates an indicator
vector indicating whether words in key-words-dictionary of a
text are in the text or not. For example, let {inflation, the flu, the referee, medicine, fans, agricultural} be key-words-dictionary of a text and. Then we would identify the possible
vectors as follows (Pilavcılar, 2007).
D1 = (0,1,0,1,0,0)
D2 = (0,0,0,1,0,0)
D3 = (1,0,0,0,0,0)
D4 = (0,0,0,0,0,1)
D5 = (0,0,1,0,0,0)
D6
= (0,0,0,1,1,0)
Term Frequency
This method is based on how many times a word
is used in a text. For the same text we would identify the
possible vectors as follows (Pilavcılar, 2007).
D1 = (0,2,0,1,0,0)
D2 = (0,0,0,1,0,0)
D3 = (1,0,0,0,0,0)
D4 = (0,0,0,0,0,2)
D5 = (0,0,2,0,0,0)
D6
= (0,0,0,1,2,0)
Term Frequency–Inverse Document Frequency (Tf-Idf) Weighting Method
The term weight is associated with importance in a text. If a word
occurs frequently in a text, then it can be said that the word is significant
for the text. Since term frequency (tf) is an importance indicator, we can
use this measure as the term weight (Salton & Singhal, 1995). In a similar way, the Inverse Document Frequency (idf)
can be used as a weight. Sometimes marginal words in a text may be more
important than a word having high frequency. For example, the term fish might not be very useful when we
deal with a collection of articles on Economic
Sources of a Mediterranean Country, however it might be very important when
deal with a collection of articles about the fossils found in an Egypt desert. Therefore, even if a
word does not occur so much, it may be considered as a decisive word (Ilhan, Duru, Karagöz & Sagir, 2008). In information
retrieval and text mining the tf–idf weights are often used to
evaluate how a word is significant to a text (Karen, 1972).
1- Pilavcılar İ.F., (Yildiz Technical
Univ.
FBE, MA). (2007). Thesis : "Text Mining and Text Classification ".
2- Salton, G. & Singhal, A., (Department
of Computer Science, Cornell University). (1995). Automatic
Text
Browsing using Vector Space Model, May 1995.
3- Ilhan, S., Duru, N., Karagöz, S.,
& Sagir,
M., (Faculty of Engineering, Department of Computer Engineering,
Kocaeli
University, T.R.). (2008). Text Mining with the
Question
Answering System.
No comments:
Post a Comment