Token is a
simple sample of a type. In text mining, tokenization is the breaking a stream
of text up into tokens, and is the very first step for preparing natural
language text for before categorization.
Firstly, the texts are broken up into
meaningful components. For example, texts can be broken up into chapters,
sections, sentences, words, phrases, symbols, or other meaningful elements
called tokens (Feldman &
Sanger, 2007).
Generally
words are surrounded by whitespace and may be followed by punctuations,
parentheses, or quotes. So, a simple tokenization rule can be stated with the
following order: Break up the character sequence at whitespace positions and
cut off punctuations, parentheses, and quotes at both ends of the fragments to
get the sequence of tokens (Sherpa & Choejey, 2008). However, while
we follow this order we may encounter with some problems and need some more
extra study.
First of all, all periods are not
punctuation. They may be markers for abbreviations such as “U.K.” “Mr.”, “Dr.”,
and so on. Only periods which are sentence markers should be removed to get
separate tokens (Feldman &
Sanger, 2007). Also,
the sentence markers “!” or “?” are generally obvious punctuations. The most
difficult symbols to distinguish are semicolons “:” and “;”.Distinguishing the
different uses of colons and semicolons is very hard without analyzing the
whole sentence. The other problem is with ordinal numbers. Ordinal numbers are
written with a trailing period after the number in Turkish or other Europe
Language. For example, “13rd” is written “13.”. These ordinal numbers cause the
same problem as abbreviations: A number which is followed by a period may
either be an ordinal number, a cardinal number in sentence-final position, or
an ordinal number in sentence-final position. Distinguishing
is not possible without contextual information. In the above expression we said
that tokens do not contain whitespace. However, there are some multi
expressions that are complex prepositions like “because of”, conjunctions like
“so that”, and adverbs like “at all”, date expressions like “Jan. 16, 1986”,
time expressions like “2:30 am”, and proper names like “AC Milan” and it is
better to accept them as single tokens.
The opposite job is done for the acronym words like “we’ll” or
“aren’t”. These words are separated into two words like “ we will” or “are
not”. The
last problem that might be encountered in tokenization is missing-whitespaces.
Sometimes there is no blank after a punctuation mark like “hours.The” or
“however,when”, that should be broken up into three tokens (Hassler & Fliedl, 2006), (Ben-Hur
& Weston). Here a step by step example of tokenization and typing of
tokens (Hassler & Fliedl,
2006).
1- identify single-tokens
2- type single-tokens
3- split sentence end markers
4- reinterpret single-token types
5- merge and split tokens recursively
6- reinterpret all token types
1- Feldman R. & Sanger J. (2007). Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press,
forthcoming 2007.
2-
Sherpa, U. & Choejey, P., (Department of Information Technology, Bhutan).
(2008). Dzongkha Text
Normalization Algorithm.
3- Hassler, M. & Fliedl, G. (2006). Book: Text preparation through extended
tokenization.
No comments:
Post a Comment