Friday, February 22, 2013

TEXT CATEGORIZATION / CLASSIFICATION

Text categorization/classification (TC) is the grouping of a text into two or more classes (Mahinovs & Tiwari, 2007). For example, news articles into “local” and “global”, e-mails into “spam” and “others”, and customer feedbacks into “positive” and “negative” can be classified.  categorization is a significant method that reduces the time to reach the information. This is one of the most important motivations for the TC. An example use for TC is to deal with span emails.
However, because of a big partition of the text-data on the internet is written in natural language, the categorization of texts is very difficult in this format. So, for overcoming this problem, these texts written in natural language should be transformed into digital texts (Asyali & Yildirim, 2004).
In addition, of course should be a pre-process that prepares the text to be categorized. For example, specific tags like xml/html are identified as blocks of text for section searching (Oracle®, 2003), non-letter characters are replaced by spaces, single-letter words should be deleted, and all characters are converted into lower cases (Tonta, Bitirim & Sever, 2002). There are several steps before text categorization. These steps are tokenization, stop word removal, stemming, feature extraction, and vector space model (Mahinovs & Tiwari, 2007).






1-       Mahinovs, A. & Tiwari, A., (Cranfield University). (2007).  Text Classification Method Review, April 2007.
2-       Asyali, M.F., (Computer Engineering, YTU, IST., T.R.) & Yildirim, T., (Electronic Engineering, YTU, IST., Turkey ). (2004). Auto Text Categorization of News Articles.
3-       Oracle®. (2003). Text Application Developer's Guide 10g Release 1. 
4-       Tonta, Y., (Hacettepe  University, ANK., T.R.), Bitirim, Y., (Eastern Mediterranean University, North Cyprus, T.R.), Sever, H., (Massachusetts University, Shrewsbury, MA). (2002).  Article: “Turkish Search Engine Performance Evaluation”.

1 comment:

  1. Ahmet,

    Can you please edit the post such that the references are shown below. (i.e. put full citation of the references at the end of the document).

    This is a problem that is very related to our discussion of clustering in class. Thank you for sharing information regarding these topics.

    It would be good if someone can make a tutorial based on this using RapidMiner instead of Statistica. The steps are essentially laid out at the last paragraph.


    Fadel

    ReplyDelete