For a
while , we had been talking about Data Mining Concepts in class. I have thought
that preprocessing data is one of
the most critical steps in order to accurately mine the data and uncover the
healthy information out of it. Since, cleaning is an essential part to
preprocess the data, to learn how to clean the data might be very useful for data
miners.
Data Cleaning
Cleaning refers to the process of removing invalid data
points from a data set.Many statistical analyses try to find a pattern in a
data series, based on a hypothesis or assumption about the nature of the
data.Cleaning is the process of removing those data points which are either (a)
Obviously disconnected with the effect or assumption which we are trying to
isolate, due to some other factor which applies only to those particular data
points. (b) Obviously erroneous, i.e. some external error is reflected in that
particular data point, either due to a mistake during data collection,
reporting etc.
Cleaning
frequently involves human judgement to decide which points are valid and which
are not, and there is a chance of valid data points caused by some effect not
sufficiently accounted for in the hypothesis/assumption behind the analytical
method applied.
The
points to be cleaned are generally extreme outliers. Outliers are those points
which stand out for not following a pattern which is generally visible in the
data. One way of detecting outliers is to plot the data points (if possible)
and visually inspect the resultant plot for points which lie far outside the
general distribution. Another way is to run the analysis on the entire data
set, and then eliminating those points which do not meet mathematical control
limits for variability from a trend, and then repeating the analysis on the
remaining data.
Data quality pertains to issues such as:
•
Accuracy
•
Integrity
•
Cleanliness
•
Correctness
•
Completeness
•
Consistency
http://www.information-management.com/infodirect/20041029/1012952-1.html
http://en.wikibooks.org/wiki/Statistics/Data_Analysis/Data_Cleaning
Data cleansing is the very first step of a data mining process. It is the process of uncovering and correcting inconsistent records from a table, a set, or database. Also it refers to the checking and correcting of data by automatic and manual processes. Verifying it is all correcting and sorting out the errors. This is used mainly in databases to identify imperfect, incorrect, erroneous and irrelevant parts of the data and then modifying, replacing or deleting the incorrect data.
ReplyDelete