Saturday, March 16, 2013

A data mining application for Chinese input

Chinese is not composed by letters. The input method is different than other languages, such as English. I will briefly introduce the input methods, and then cite an example from internet which is an application of data mining in Chinese input.

The most popular Chinese input method is called "pinyin". Each character has a phonetic alphabet to show its pronunciation, like [prə,nʌnsɪ'eɪʃ(ə)n] in English. If I type a combination of letters, the computer will show a list of word which pounced in this way, and ask me to choose one of these. So in this way, the input process is completed. For example, I want input Chinese word "data". First, I type "shu", the computer will show a list of character, I choose one of them, after this, I need type "ju" to input another character.

The problem is there are so many Chinese characters in same pronunciation. Choosing character one by one is to slow. There are two ways to improve the input speed. The input software lists the most popular character first, or allow users to type letter combinations in words, not single characters. The same example in last paragraph, I can type "shuju" in one time, and choose the whole word. 

Hence, find a way of discovery new words is demanding.

I find a example on internet, it is a patent from Microsoft. They have a method to discovery new word by data mining from query log. In the figure below (get from the patent), is the process of mining.

After a new word is discovered or summarized, it is added into the dictionary of MS input software. When users type this combination of letters in the future, the software will  show the new word to users for choosing. This process is shown in the pic below



ref: http://www.faqs.org/patents/app/20100088303

No comments:

Post a Comment