Analytics and Visualization of Big Data: Vector Creating From A Document Set

Matrix representation of the document vectors:

There are three different methods to express a text as a vector; by Binary Frequency, by Frequency and by Term Frequency–Inverse Document Frequency (Tf-Idf) Weighting Method (Pilavcılar, 2007).

Binary Frequency

This method creates an indicator vector indicating whether words in key-words-dictionary of a text are in the text or not. For example, let {inflation, the flu, the referee, medicine, fans, agricultural} be key-words-dictionary of a text and. Then we would identify the possible vectors as follows (Pilavcılar, 2007).

D1 = (0,1,0,1,0,0)

D2 = (0,0,0,1,0,0)

D3 = (1,0,0,0,0,0)

D4 = (0,0,0,0,0,1)

D5 = (0,0,1,0,0,0)

D6 = (0,0,0,1,1,0)

Term Frequency

This method is based on how many times a word is used in a text. For the same text we would identify the possible vectors as follows (Pilavcılar, 2007).

D1 = (0,2,0,1,0,0)

D2 = (0,0,0,1,0,0)

D3 = (1,0,0,0,0,0)

D4 = (0,0,0,0,0,2)

D5 = (0,0,2,0,0,0)

D6 = (0,0,0,1,2,0)

Term Frequency–Inverse Document Frequency (Tf-Idf) Weighting Method

The term weight is associated with importance in a text. If a word occurs frequently in a text, then it can be said that the word is significant for the text. Since term frequency (tf) is an importance indicator, we can use this measure as the term weight (Salton & Singhal, 1995). In a similar way, the Inverse Document Frequency (idf) can be used as a weight. Sometimes marginal words in a text may be more important than a word having high frequency. For example, the term fish might not be very useful when we deal with a collection of articles on Economic Sources of a Mediterranean Country, however it might be very important when deal with a collection of articles about the fossils found in an Egypt desert. Therefore, even if a word does not occur so much, it may be considered as a decisive word (Ilhan, Duru, Karagöz & Sagir, 2008). In information retrieval and text mining the tf–idf weights are often used to evaluate how a word is significant to a text (Karen, 1972).

1- Pilavcılar İ.F., (Yildiz Technical Univ. FBE, MA). (2007). Thesis : "Text Mining and Text Classification ".

2- Salton, G. & Singhal, A., (Department of Computer Science, Cornell University). (1995). Automatic Text

Browsing using Vector Space Model, May 1995.

3- Ilhan, S., Duru, N., Karagöz, S., & Sagir, M., (Faculty of Engineering, Department of Computer Engineering,

Kocaeli University, T.R.). (2008). Text Mining with the Question Answering System.

4- Spärck Jones, Karen (1972). "A statistical interpretation of term specificity and its application in retrieval".

Analytics and Visualization of Big Data

Saturday, March 9, 2013

Vector Creating From A Document Set

No comments:

Post a Comment