Tuesday, April 23, 2013

Weeding out the noise


While studying Big data, one might misinterpret how data mining works. You first must understand that information does not equal insight. While insight always entails information, information does not always entail insight. Dr. Michael Wu explains 3 criteria for information to provide valuable insights. 
1. Interpretability. Because big data can be so unstructured and diverse there is a large amount of data that can be uninterpreted. 
For example, consider this sequence of numbers: 123, 243, 187, 89, and 156. This data could mean a number of things. (Street addresses, the total minutes it takes to write a blog, number of candies in a bag) The point that Dr. Wu is making with this criteria is that, without the metadata to describe this data further you are unable to interpret and therefore cannot gain any insight from it. 
2. Relevance. Information must be relevant in order for it to be of any use. Relevant info is sometimes referred to as a signal whereas irrelevant information is referred to as noise. But relevance is a very relative term. "Information that is relevant to me may be completely irrelevant to you, and vice versa. Relevance is not only subjective, it is also contextual. If I’m visiting NYC next week, then NYC traffic will suddenly become very relevant to me. But after I return to Alabama, the same information will instantly become irrelevant again."
.3. Novelty. Information must be novel, meaning that this information is new and does not tell you something that you already know.
Clearly this criteria is also very relative. It is quite obvious that something I know as old, you might find out as new, and something that i might find insightful you might not. 

1 comment:

  1. The three criteria that you discussed according to Dr. Wu are closely related to the four criteria that we have discussed in class. We have discussed that patterns and models should be valid, useful, unexpected, and understandable in order to make an impact.

    Interpretability relates to the data being understandable. We want to be able to understand the data in a way that we were not able to before. Visualizations are especially helpful in understanding large datasets that have been mined.

    Relevance relates to the data being useful. We only want to work with data that is something we will be able to use to help us accomplish something.

    Novelty relates to the data being unexpected. There is no point in data mining information that is old news. The purpose of data mining is finding new information.

    Dr. Wu does not discuss a criterion that is directly related to the validity of data. However, we need to have some faith in our data so that we know that our analyses are worth something.

    As we take on data mining projects, we need to keep the criteria that have been discussed in mind to make sure that what we are doing has some sort of purpose.

    ReplyDelete