Text categorization/classification
(TC) is the grouping of a text into
two or more classes (Mahinovs
& Tiwari, 2007). For example, news articles into
“local” and “global”, e-mails into “spam” and “others”, and customer feedbacks
into “positive” and “negative” can be classified. categorization
is a significant method that reduces the time to reach the information. This is
one of the most important motivations for the TC. An example use for TC is to
deal with span emails.
However,
because of a big partition of the text-data on the internet is written in
natural language, the categorization of texts is very difficult in this format.
So, for overcoming this problem, these texts written in natural language should
be transformed into digital texts (Asyali
& Yildirim, 2004).
In addition, of
course should be a pre-process that prepares the text to be categorized. For
example, specific tags
like xml/html are identified as blocks of text for section searching (Oracle®, 2003),
non-letter characters are replaced by spaces, single-letter words should be
deleted, and all characters are converted into lower cases (Tonta, Bitirim
& Sever, 2002). There are several steps before text
categorization. These steps are tokenization, stop word removal, stemming,
feature extraction, and vector space model (Mahinovs & Tiwari, 2007).
1- Mahinovs, A. & Tiwari, A., (Cranfield University).
(2007). Text Classification Method Review,
April 2007.
2-
Asyali, M.F., (Computer Engineering, YTU, IST., T.R.)
& Yildirim, T., (Electronic Engineering, YTU, IST., Turkey ). (2004). Auto
Text Categorization of News Articles.
3- Oracle®. (2003). Text Application Developer's Guide 10g
Release 1.
4- Tonta, Y., (Hacettepe
University, ANK., T.R.), Bitirim, Y., (Eastern Mediterranean University, North Cyprus, T.R.), Sever, H., (Massachusetts University, Shrewsbury,
MA).
(2002). Article: “Turkish Search Engine Performance Evaluation”.
Ahmet,
ReplyDeleteCan you please edit the post such that the references are shown below. (i.e. put full citation of the references at the end of the document).
This is a problem that is very related to our discussion of clustering in class. Thank you for sharing information regarding these topics.
It would be good if someone can make a tutorial based on this using RapidMiner instead of Statistica. The steps are essentially laid out at the last paragraph.
Fadel