Sunday, February 24, 2013

TOKENIZATION



Token is a simple sample of a type. In text mining, tokenization is the breaking a stream of text up into tokens, and is the very first step for preparing natural language text for before categorization. 

Firstly, the texts are broken up into meaningful components. For example, texts can be broken up into chapters, sections, sentences, words, phrases, symbols, or other meaningful elements called tokens (Feldman & Sanger, 2007).

Generally words are surrounded by whitespace and may be followed by punctuations, parentheses, or quotes. So, a simple tokenization rule can be stated with the following order: Break up the character sequence at whitespace positions and cut off punctuations, parentheses, and quotes at both ends of the fragments to get the sequence of tokens (Sherpa & Choejey, 2008). However, while we follow this order we may encounter with some problems and need some more extra study.

First of all, all periods are not punctuation. They may be markers for abbreviations such as “U.K.” “Mr.”, “Dr.”, and so on. Only periods which are sentence markers should be removed to get separate tokens (Feldman & Sanger, 2007). Also, the sentence markers “!” or “?” are generally obvious punctuations. The most difficult symbols to distinguish are semicolons “:” and “;”.Distinguishing the different uses of colons and semicolons is very hard without analyzing the whole sentence. The other problem is with ordinal numbers. Ordinal numbers are written with a trailing period after the number in Turkish or other Europe Language. For example, “13rd” is written “13.”. These ordinal numbers cause the same problem as abbreviations: A number which is followed by a period may either be an ordinal number, a cardinal number in sentence-final position, or an ordinal number in sentence-final position. Distinguishing is not possible without contextual information. In the above expression we said that tokens do not contain whitespace. However, there are some multi expressions that are complex prepositions like “because of”, conjunctions like “so that”, and adverbs like “at all”, date expressions like “Jan. 16, 1986”, time expressions like “2:30 am”, and proper names like “AC Milan” and it is better to accept  them as single tokens. The opposite job is done for the acronym words like “we’ll” or “aren’t”. These words are separated into two words like “ we will” or “are not”.  The last problem that might be encountered in tokenization is missing-whitespaces. Sometimes there is no blank after a punctuation mark like “hours.The” or “however,when”, that should be broken up into three tokens (Hassler & Fliedl, 2006), (Ben-Hur & Weston). Here a step by step example of tokenization and typing of tokens (Hassler & Fliedl, 2006).
1-       identify single-tokens
2-       type single-tokens
3-       split sentence end markers
4-       reinterpret single-token types
5-       merge and split tokens recursively
6-       reinterpret all token types




1-       Feldman R. & Sanger J. (2007). Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, forthcoming 2007.
2-       Sherpa, U. & Choejey, P., (Department of Information Technology, Bhutan). (2008). Dzongkha Text Normalization Algorithm.
3-       Hassler, M. & Fliedl, G. (2006). Book: Text preparation through extended tokenization.

No comments:

Post a Comment