Analytics and Visualization of Big Data: Text Mining and Predictive Analytics

Text Mining and Predictive Models

Advances in storage capabilities, huge data collections, and easy access to target data left people in an immense data pool. One of the most important ways to deal with this problem is Data Mining. Data mining is the analysis process of discovering knowledge in a database. However, according to research by Merrill Lynch and Gartner, 85-90% of the data all over the world are stored in unstructured form (McKnight, 2005), and thus, data mining algorithms are not enough by themselves. At this point Text Mining plays an important role.

Text mining is the process of exploring structured data and extracting useful information from a collection of unstructured data. Text mining methods can be used in very different areas including business documents, customer reviews, web pages, e-mails and other sources.

One of the most popular text mining techniques is predictive modeling. Decision trees, neural networks and boosted trees are different types of predictive models. Predictive models are used to determine which class a set of data belongs to. For example, a technology company can apply predictive modeling algorithms to specifically target customers, and so before generating a new model.

Figure-1: Text mining Process

By using the methods of data mining, choosing this regular data through piles of unstructured dispersed data is becoming very important. Text and data mining are similar at the point that both try to obtain information from massive and unstructured sources.

However, text mining is based on text sources (Chang, Healey, McHugh, Wang, Jason, 2001), (Kroeze & Bothma, 2007). Rregular structured data are extracted from unstructured data (text) and thus hidden information is discovered. This process is done with a variety of text mining techniques. (Kroeze & Bothma, 2007)

Natural language processing (NLP) is a sub-discipline of computer science and linguistics. In NLP, natural language texts and/or sounds carried out on the studies in computer processing. Therefore, modern statistical NLP algorithms require using of linguistics, computer science, and statistics (Charniak, 1984). All programming languages used around the world have specific structures, rules, and a standard filed. Natural languages cannot be explained so easily. All around the globe, there are hundreds of different official/known languages and each language has more than 100,000 words. In addition, the fact that language courses are always changing and expanding with a lot of uncertainty, and each language has its own unique grammar structure. For this reason, it is impossible a text mining software to interpret a language 100% correctly (Erol, 2009).

Text mining Process (see Figure-1) (Stavrianou, Andritsos, & Nicoloyannis, 2007) is divided into four main categories: text classification or text categorization (TC), association analysis, clustering and information extraction (IE). The classification or TC process is to include categories or classes of objects previously known. Association analysis is used to identify the words which are often associated with each other or developing and to make the sets of documents or the contents of documents more understandable. IE techniques are used to find the useful data in the documents or statements. Cluster analysis is used to discover the underlying structure of the document sets.

1- Chang, G., Healey, M.J., McHugh, J.A. & Wang, Jason T.L. (2001). Book: Mining the World Wide Web, An information Search Approach.

2- Kroeze, J.H. & Bothma, T.J.D., (Department of Informatics, M.C. Matthee,Department of Informatics, Department of Information Science, University of Pretoria, Pretoria). (2007). Differentiating between data-mining and text-mining terminology

3- Charniak, Eugene: Introduction to artificial intelligence, page 2. Addison-Wesley, 1984.

4- Erol, U. (2009). Article: “What is Text Mining?”. From http://www.metinmadenciligi.com .

5- Stavrianou, A., Andritsos, P., & Nicoloyannis, N. (2007). Overview and semanticissues of text mining. ACM SIGMOD Record, 36(3):23–34

6- McKnight, Radicati, S. & Hoang, Q. (2011). Email Statistics Report 2011-2015. The Radicati Group, Inc. A Technology Market Research Firm. From

http://www.radicati.com/?p=7261

Analytics and Visualization of Big Data

Thursday, January 24, 2013

Text Mining and Predictive Analytics

2 comments: