Feature
selection is applied to eliminate insignificant and irrelevant features
from texts and create subsets consisting of significant features. And so, the
existing features are simplified, the dimension of the
texts are reduced, in other words, the existing features are transformed into a lower dimensional space and thus the comprehensibility can be significantly
improved (Feldman & Sanger, 2007). In
short, feature selection is aimed at creating a more suitable and clearer data
to be analyzed easily and to see important but hidden points (Kim, Street, 2003).
There are many filters
to evaluate and eliminate features. In order to perform the filtering, the
relevance of features such as document frequency should be calculated as a
measure. Typically interclass distance,
statistical dependence or information-theoretic measures can be given as
examples for filters. Some of them are very strength, may remove almost 90-99%
of the all features (Feldman & Sanger,
2007).
Document
frequency of a word is the number of appearing of the word in that
document. A researcher identifies a certain point of frequency such that
the words that have frequencies under that point are removed
from the document. This method gets its base from
the idea that words that appear less than a certain threshold do not have a
decisive role or ability to identify categories. By the way there are other useful measures of feature
relevance that take into account the relations between features and categories.
For example information gaining and chi-square (Feldman & Sanger,
2007).
1- Feldman R. & Sanger J. (2007). Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data, Cambridge
University Press, forthcoming 2007.
University Press, forthcoming 2007.
2- Kim, Y.S., Street, W.N., & Menczer, F., (University
of Iowa). (2003). Feature Selection in
Data Mining.
No comments:
Post a Comment