Wednesday, February 20, 2013

Top five challenges on text mining


Here’s a quick look at five challenges users said they most often run into with text analytics.
  1. Data access. Often, companies will want to utilize more than one source of unstructured data for analysis, but gaining access to this data can be challenging. This is more than getting a hold of the Twitter fire hose for customer intelligence analysis. This is about the right to use internal or cross-company data stores like institutional document repositories in the face of corporate politics or delays due to operational procedures, like making formal requests for the data from IT.
  2. Managing expectations. In some organizations, text analytics can leave management with the idea that you can simply plug in the software, feed it text data, and have it automatically give you the answers. While you may be able to get some high-level answers this way using tools tuned for social media, the reality is that most of the time you’ll have to interact with the software, especially when it comes to building a taxonomy (see No. 5). Text analytics tends to be more semi-automatic than automatic.
  3. Trusting the data. On the flip side of managing inflated expectations is the need to establish trust in the data. This challenge can manifest itself in terms of data quality and as a cultural issue.
    Determining data quality for unstructured data is hard for many reasons including the fact that words have multiple meanings and unstructured text can be noisy with typos, colloquialisms, and so on. Often times, with text data you’re going to get about 70 percent to 80 percent accuracy. That can be a challenge for some people.
    Using text analytics in decision-making also requires a cultural change, which can be difficult. For example, in organizations that are used to classifying content manually, moving to a semi-automated approach can be a big shift and people might not believe the classification schemes. They'll be skeptical -- sometimes because of the way the analysis is presented. For instance, structured data might indicate that people are buying a wireless company's phones. Because sales are up, executives might not believe that the unstructured data in call center notes or on the Web shows negative sentiment about the phones -- that they're buying them only because their choices are limited. You need to be able to tell the story and make people understand the kind of analysis you can do with this new source of data. This can take time.
  4. Building the skills. The skills you’re going to need to analyze text will vary depending on the problem you’re trying to solve. Some people claim that you need to understand your industry. Others say being analytical is enough. If the goal is using a social media analytics tool to do some high-level analysis on brand reputation, you'd likely need only a small amount of training. But if you’re trying to combine structured and unstructured data to increase the lift of a predictive model, then you'll need deeper skills development. Regardless of the issue you’re looking to address, text analytics involves dealing with a new form of data and there is going to be a learning curve involved in knowing what to do and how to apply it to the business. You’ll also have to know how to ask the right kind of questions. This is a learning process.
  5. Taxonomy issues. A taxonomy is a method for organizing information, or sometimes categories, into hierarchical relationships. Because a taxonomy defines the relationships between the terms a company uses, it makes it easier to find and then analyze text. Some organizations hire people skilled in taxonomy development to build it. Some vendors provide out-of-the-box taxonomies for certain industries. Even so, you’re going to have to deal with the vagrancies of the terminology in your industry, and there is going to be upfront work to specify this terminology. Many end-users feel that the necessary taxonomy development, or refining their categories (if that is the way you’re ultimately building a taxonomy), is difficult. It can take more than one try. Companies need to plan for this.



References:
1. www.datamining.typepad.com
2. http://www.kdnuggets.com/

No comments:

Post a Comment