Analytics and Visualization of Big Data: DATA CLEANING IN STATISTICA

Sunday, March 3, 2013

DATA CLEANING IN STATISTICA

For a while , we had been talking about Data Mining Concepts in class. I have thought that preprocessing data is one of the most critical steps in order to accurately mine the data and uncover the healthy information out of it. Since, cleaning is an essential part to preprocess the data, to learn how to clean the data might be very useful for data miners.

Data Cleaning

Cleaning refers to the process of removing invalid data points from a data set.Many statistical analyses try to find a pattern in a data series, based on a hypothesis or assumption about the nature of the data.Cleaning is the process of removing those data points which are either (a) Obviously disconnected with the effect or assumption which we are trying to isolate, due to some other factor which applies only to those particular data points. (b) Obviously erroneous, i.e. some external error is reflected in that particular data point, either due to a mistake during data collection, reporting etc.

Cleaning frequently involves human judgement to decide which points are valid and which are not, and there is a chance of valid data points caused by some effect not sufficiently accounted for in the hypothesis/assumption behind the analytical method applied.

The points to be cleaned are generally extreme outliers. Outliers are those points which stand out for not following a pattern which is generally visible in the data. One way of detecting outliers is to plot the data points (if possible) and visually inspect the resultant plot for points which lie far outside the general distribution. Another way is to run the analysis on the entire data set, and then eliminating those points which do not meet mathematical control limits for variability from a trend, and then repeating the analysis on the remaining data.

Data quality pertains to issues such as:

• Accuracy

• Integrity

• Cleanliness

• Correctness

• Completeness

• Consistency

http://www.information-management.com/infodirect/20041029/1012952-1.html
http://en.wikibooks.org/wiki/Statistics/Data_Analysis/Data_Cleaning

1 comment:

UnknownMarch 3, 2013 at 9:15 PM
Data cleansing is the very first step of a data mining process. It is the process of uncovering and correcting inconsistent records from a table, a set, or database. Also it refers to the checking and correcting of data by automatic and manual processes. Verifying it is all correcting and sorting out the errors. This is used mainly in databases to identify imperfect, incorrect, erroneous and irrelevant parts of the data and then modifying, replacing or deleting the incorrect data.
ReplyDelete
Replies

Add comment