Analytics and Visualization of Big Data: KNIME Software

KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME was developed by University of Konstanz Visual Data Mining research group based on Eclipse Rich Client Platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. Since it is based on the Eclipse, it provides assembly of nodes for data preprocessing (ETL: Extraction, Transformation, Loading) using a graphical user interface, for modeling and data analysis and visualization. KNIME is software to allow users to create their own module providing a software development kit.

Figure 1- KNIME screen figure Figure 2- Data Source Adding Objects

KNIME can be used for every operating system supported by the Eclipse platform. KNIME includes its own JRE (Java Runtime Environment). Thus, it does not need to be installed in operating. Since 2006, KNIME is used usually in pharmaceutical research, but is also used in other areas like CRM (customer data analysis), business intelligence and financial data analysis.

Data Sources

KNIME can import data from the text files (TXT) or attribute-relation format files (ARFF), TABLE format files.

The cool thing with KNIME to import data is that it allows user to define how much data you keep in your memory and how much you keep in your hard disk. This feature of the KNIME decreases the chances to have over memory problem working on the large data sets.

Furthermore, it supports importing data using SQL and using Predictive Model Markup Language which based on XML language.

KNIME, in addition to importing data, it has Data Write components providing export process.

Data Preprocessing

KNIME does not have any special component for preprocessing but there are some algorithms can be used for data preprocessing.

Data Mining Algorithms

KNIME has most algorithms used for data mining literature such as Support Vector Machines, Bayes and Multidimensional Scaling. In addition allowing using different advanced algorithms, KNIME also supports to use some statistical methods such as regression, correlation, and correlation filter on data streaming design.

Figure 3- KNIME Panel for the selection Figure 4- KNIME visualization tools

of the data mining algorithm

Data Streaming Design

Designing the objects in KNIME is done by dragging the objects from the “node repository” panel to canvas. To connect the objects, user needs to click the object and then click the other object using binding lines. Data stream diagram process structure is made by running the each node separately. The green light on bottom of the node should be on if that node runs without any error. After checking nodes, the next step is configuration set-up and then the model can be run. Note that if the green light on the previous node is not on, the next node cannot run.

Figure 5- Data Stream Diagram

Visualization

KNIME is one of the richest software comparing with the data mining software literature. In addition to many visualization tools such as scatter plot, parallel coordinates, box plot and histogram, it also provides very detailed Java based visualization tools based on using JFreeChart.

Figure 6- Scatter Plot Graph after Figure 7- - Result Table after

running the K-means on KNIME running the K-means on KNIME

Analytics and Visualization of Big Data

Wednesday, February 27, 2013

KNIME Software

1 comment: