Thursday, February 28, 2013

Feature Selection


Feature selection is applied to eliminate insignificant and irrelevant features from texts and create subsets consisting of significant features. And so, the existing features are simplified, the dimension of the texts are reduced, in other words, the existing features are transformed into a lower dimensional space and thus the comprehensibility can be significantly improved  (Feldman & Sanger, 2007). In short, feature selection is aimed at creating a more suitable and clearer data to be analyzed easily and to see important but hidden points (Kim, Street, 2003). There are many filters to evaluate and eliminate features. In order to perform the filtering, the relevance of features such as document frequency should be calculated as a measure. Typically interclass distance, statistical dependence or information-theoretic measures can be given as examples for filters. Some of them are very strength, may remove almost 90-99% of the all features (Feldman & Sanger, 2007). Document frequency of a word is the number of appearing of the word in that document. A researcher identifies a certain point of frequency such that the words that have frequencies under that point are removed from the document. This method gets its base from the idea that words that appear less than a certain threshold do not have a decisive role or ability to identify categories. By the way there are other useful measures of feature relevance that take into account the relations between features and categories. For example information gaining and chi-square (Feldman & Sanger, 2007).



1- Feldman R. & Sanger J. (2007). Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge 
    University Press, forthcoming 2007.
2- Kim, Y.S., Street, W.N., & Menczer, F., (University of Iowa). (2003). Feature Selection in Data Mining.

No comments:

Post a Comment