Monday, April 8, 2013

Hotter than Hadoop: MongoDB




Someone may ask, why do I compare Hadoop and MongoDB? First, I have to say, Hadoop is neither a database nor a single program. Hadoop is a set of projects integrated together that can achieve a common goal.  MongoDB is a scalable, high-performance, open source, and NoSQL database, and it is downloaded more than 4 million times so far. Hadoop cannot exist alone so a database is needed to manage massive data every month, every day, even every second.

Here are some successful cases using MongoDB.  Shutterfly, Inc. is a family of brands in the business that offers personalized photo products and services. The business needed something more flexibility than Oracle. By using MongoDb, Shutterfly reduced 500% cost and improved 900% performance compared to Oracle. Wordnik.com is an online dictionary and language resource that provides dictionary and thesaurus content. Their problem is MySQL could not scale to handle their 5 billion documents so they wanted to switch to MongoDB. The results showed that NoSQL has 2000% performance improvement over MySQL.

The main advantage to use MongoDB is complexity reduction. It gets rid of migrations and relationship. It also reduces number of database requests. Not to mention it has Map reduce function!! MongoDB is one of the best tools in Big data area.


References:

Big Data and Television.

How Nielsen Ratings Have Become Irrelevant

All your favorite shows are terrible performers in the ratings game. Mad Men, Girls, Breaking Bad, 30 Rock, Game of Thrones, Community, and Parks and Rec all perform relatively terribly when measured using the standard Nielsen ratings. You may ask why they are still on the air, because they are insanely popular. In fact, it is shows like this that illustrate a key flaw in the Nielsen rating system.

The way Nielsen ratings are generated is through monitoring 25,000 households across the country with a combination of a television monitor and self reported diaries. Nielsen then makes statistical inferences based on this sample of the population. The major flaw with this approach that these shows illustrate is the lack of consideration for DVRs or new online streaming services like Hulu and Netflix.  While Nielsen recently started offering ratings that include DVR viewers within three and seven days of original air, there is no real explanation as to why those two are the only offered timeframes or why these are not the default ratings used. If you include DVR viewers, Mad Men viewership increases by 127%, Breaking Bad’s by 130%.

Nielsen ratings also do not account for audience captivation or buzz generation. These things are of major importance to TV executives and advertising agencies alike. This can be shown using social media; during peak usage at least 40% of Twitter’s traffic is about television. Startups see this gap in the market, and are beginning to fill them. Two major players in this field are Trendrr and Bluefin Labs, both gather data from social networks and offer slightly different perspectives on the data. Trendrr focuses on the ability to do real-time analysis, while Bluefin tracks user engagement with ads. Both of these new rating systems point out another thing that Nielsen does not track, piracy.

Piracy may not be a good selling point to advertisers, but it should be to network executives. With more popular shows being illegally downloaded millions of times per week, there is a significant opportunity for executives to increase viewership by figuring out how to convince pirates to view these shows legally.

So what does this mean in regards to big data? What it really shows is an opportunity. Big data analysis can, and likely will, change the way that shows are judged. With Trendrr and Bluefin Labs leading one facet of the possible analysis these changes have already begun. But there is a much bigger way that improvement could be made in ratings. Now 25,000 families viewing patterns may sound like a lot, but what if you could increase that number by orders of magnitude. Netflix has 29.4 million subscribers whose view patterns they are analyzing, and Hulu clocks in at around 38 million, suddenly 25,000 seems like child’s play. And with taste becoming more and more fragmented and niche, how certain can you be that the less than .03% sample of the population is generating an accurate representation of the remaining 99.97%.

Sources:
http://www.wired.com/underwire/2013/03/nielsen-family-is-dead/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Top+Stories%29
http://en.wikipedia.org/wiki/Nielsen_ratings
http://www.statisticbrain.com/hulu-statistics/

http://en.wikipedia.org/wiki/Netflix

The Right to Know Act

California lawmaker introduces ‘Right to Know Act’ to give citizens access to their data
Image from Original Article
     Bonnie Lowenthal, California State Assemblymember, has introduced the "Right to Know Act 2013", an act that would give consumers the right for them to request and obtain their personal data from businesses, and also require businesses to share with their consumers how and with whom they are sharing their data. 
     According to the act, the bill would provide such personal information as well as the names and contact information for third parties with which the business has shared the information within the past 12 months, at not charge, within 30 days of original consumer request.  
     Consumer privacy is a huge issue today- many consumers do not know what information is being tracked about them, and how to attempt to stop the sharing of such data. This act could aid in the transparency of businesses to inform their consumers about how they are using their personal information to aid in sales. However, even though the article states that this act shouldn't restrict data sales, there is a concern that the bill's purpose does exactly that; if consumers know where their data is being used by which company, this bill could potentially enable the consumer, through contact requests to those companies, to restrict the use, and sale, of their data. The article states that companies use the "big data" of consumer information to "optimize their business strategies, create revenue streams, and attract advertisers"; therefore, chances are that tech companies aren't thinking too fondly of this new bill since it will restrict their main source of revenue. 
     Aside from the potential revenue hit companies may receive, acceptance of the bill in legislature would help in bringing some piece of mind to the billions of people who generate personal data on the internet everyday. 

Look Who's Talking!!

Look Who’s Talking!!


About 3 years ago, Ford began offering a system to dealerships that reads the dealers’ inventory, checks supplies of vehicles and makes recommendations about vehicles- type and make which have to be stocked up in order to maximize profits. Ford Motor Co. has achieved a great deal by mashing together large databases and analytical algorithms to create probably the next big thing. 
Your car knows a lot about you, and it is talking!
Automakers are exploring ways to use information from cars on the road to improve the driving experience, car design, fuel efficiency, financing among many other things. Some of the key aspects highlighted in this development are discussed below.
Data collected from cars as they are being driven in different topographical zones can be used to interpret road conditions and warn other drivers in cars about potential hazards. We live in an era of instantaneous communication and this is very easily implemented.
Cars also send their location to a traffic management database system that changes signals based on real time and very situation responsive rather than having the same routine on a loop the whole day. This could vary based on rush hour, accident occurrences, ambulance on the road and so on.
Payment data which is collected by dealerships, banks, credit unions and so on are linked to current car values allowing dealers to reach out owners at the best time for them to trade in their vehicles. The car can have the capability to display on the dash the current going rates and the best available deals on future models.
Data on driving patterns which the car spits out to them after considerable amount of driving is sued to help car companies make better decisions on the sizes of batteries needed and the handshake between hybrid and driving on the internal combustion engines. For example I know the current Ford Taurus hybrid has a leaf on the dashboard and it changes color (green and red ironically) to display how environment friendly the driver is driving. This data can also be used to detect charging patterns for electric cars and help automate charging cycles to ensure customers get the best energy rates.
A lot is already being done by automakers and all this requires tremendous knowledge of Data Mining and also key factors such as cloud storage, data cleaning and visualization techniques are constantly being implemented in order to fully interpret the meaning of all this data gathered form millions of car users across the globe. The can not only help the automakers operate profitably but also make the planet safer and take a more green initiative to motoring.
References: The Wall Street Journal –Big Data edition

Sunday, April 7, 2013

A confusion, Knowledge Discovery or Data Mining ?

Although there have been many practical methods developed and used in data mining, the distinction between data mining and knowledge  discovery concepts are not clear yet.

The most critical starting point to extinguish this confusion is to summarize the basic concepts about data mining and knowledge discovery.

--Knowledge discovery is a non-trivial process for identifying valid, new, potentially useful and ultimately understandable patterns in data which consists of nine steps while data mining is the 7 th of  those steps.

The above-mentioned 9 steps are as follows;

1. Development and understanding of the application domain

2. Creating a target data set: select the data set, or focusing on a set of variables or data samples on which the discovery was made.3. Data cleaning and preprocessing.  transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. 
4. Data reduction and projection: finding useful features to represent the data depending on the purpose of the task. Through dimensionality reduction methods or conversion, the effective number of variables under consideration may be reduced, or invariant representations for the data can be found.
5. Matching process objectives: KDD with  a method of mining particular. For example, summarization, classification, regression, clustering and others.
6. Modeling and exploratory analysis and hypothesis selection: choosing the algorithms or data mining, and select the method or methods to be used in the search for patterns of data. This process includes deciding which model and parameters may be appropriate 
7. Data Mining: the search for patterns of interest in a particular representational form or a set of these representations, including classification rules or trees, regression and clustering. The user can significantly aid the data mining method to properly carry out the preceding steps.
8. Interpreting mined patterns, possibly returning to some of the steps between step 1 and 7 for additional iterations. This step may also involve the visualization of the extracted patterns and models or visualization of the data given the models drawn.
9. Acting on the discovered knowledge: using the knowledge directly, incorporating the knowledge in another system for further action, or simply documented and reported to stakeholders.

http://books.google.com/books?id=alHIsT6LBl0C&pg=PA1161&lpg=PA1161&dq=what+is+the+difference+between+data+mining+and+knowledge+discovery&source=bl&ots=pqHBwbAOjv&sig=RkfNlkC8sqoJDfjoFOo-SfdG_kE&hl=en&sa=X&ei=EDRiUZvdBoi88AT05oH4CQ&ved=0CGAQ6AEwBQ#v=onepage&q=what%20is%20the%20difference%20between%20data%20mining%20and%20knowledge%20discovery&f=false

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1547224

http://smartdatacollective.com/josueoteiza/38043/difference-between-knowledge-discovery-and-data-mining




A Web-based Movie Forecast Tool by Using Data Mining Techniques

While the data mining concepts are being discussed, I wanted to share an interesting web-based DSS tool which was made to help  Hollywood Film Makers to make better decisions on important movie characteristics before they happen so spend a lot of money.

As we might guess, prediction of financial success of a movie is arguably the most important piece of information needed by decision makers in the motion picture industry.


Any information about the critical factors affecting  success of a movie would be of great use in planning the future films and  related decisions about the movie industry where the results of managerial decisions are measured in millions of dollars, 

3 OSU Professors, in 2007,  realizing the need for a tool that will help movie industry decision makers,   have come up with a web-based tool which predicts  how much roughly  a film will do in the marketplace, before it goes to box-office.The tool is designed and developed to take full advantage of the latest technologies in Internet and in DSS.



They built 4 types of  models, namely Neural Networks, Decision Trees, Ordinal Logistic Regression and Discriminant analysis.They have  also used an information fusion meta-model that combines the output of these three models.

Considering the the dependent variable in the study as the box-office gross revenues, they included several potential important factors (independent variables ) into the models.

The results that the tool provides are fairly promising for the Hollywood Film Makers.

It might also be a strong inspiration point for people  working for other industries where a lot of predictions have to be made.....



http://www.sciencedirect.com/science/article/pii/S0167923605001053


Verizon Precision Market Insights


I talked about data mining for marketing, one of the problem is to get data. Now Verizon offers merchant a useful service called Precision Market Insights. 




Today, people use cell phone to do more and more things. For example,  book airline tickets, share photos, browse webs. All this activities could leak their ideas of some products or service. 

This service is to collect users data and analyze it. After analysis, the result would be sold to merchants. The data includes:

- Locations of users, and the tracking of a users would be provided. 
- Age, marriage or another demographic information. 
- Mobile usage. This category includes websites visited by user, downloads information or other information. 
- Communication information. The feedback or other responses of one merchant will be included. 

This service generally targets: 

- Media Owners
- Advertisers
- Venue Owners
- Sponsors
- Retails

All these merchants need information from the consumers. By using this service, they could track their customer, getting feedback or predict their customer behavior. 

The concerns of customers are their privacy. Almost everything of their life are expose to the merchants. To solve this, Verizion has a statement and give a choice to customers. This plan is not mandatory, users could choice to participte this, which means agree Verizon to provide their personal information to others. Also, users could quit this anytime. Someone says this is not break any law, but others think this is illegal. 

In my opinion, if users information are used to make money, users who choose participate this plan should be given some discounts. 

Ref
http://business.verizonwireless.com/content/b2b/en/precision/our-measurement-solutions.html
http://business.verizonwireless.com/content/b2b/en/precision/precision-market-insights.html#
http://www.fiercemobilecontent.com/story/verizon-app-usage-monitoring-raises-consumer-privacy-fears/2012-10-16

Earthquake prediction by data mining and visualization.



New techniques based on cluster analysis of the multi–resolution structure of earthquake patterns is developed and applied to observed synthetic seismic catalogs. The synthetic data were generated by numerical simulations for various cases. At the highest resolution, analysis of the local cluster structure in the data space of seismic events for the two types of catalogs by using an agglomerative clustering algorithm is carried out. Seismic event, quantized in space and time, generate the multi – dimensional feature space of the earthquake parameters. Using a non - hierarchical clustering algorithm and multi – dimensional scaling, the multitudinous earthquakes by real time 3D visualization and inspection of multivariate clusters is explored. The resolutions characteristic of the earthquake parameters, all of the ongoing seismic activity before and after largest events accumulate to a global structure consisting of a few separate clusters in the feature space. By combining the clustering results from low and high resolution spaces, we can recognize precursory events more precisely and decode vital information that cannot be discerned at a single level of resolution
The understanding of earthquake dynamics and development of forecasting algorithms require a knowledge and skill in both measurement and analysis that cover various types of data, such as seismic, electromagnetic, gravitational, geodetic, geochemical, etc. The Gutenberg - Richter power law distribution of earthquake sizes implies that the largest events are surrounded (in space and time) by a large number of small events. The multi - dimensional and multi - resolutional structure of this global cluster depend strongly on geological and geophysical conditions. Past seismic activities are closely associated events (e.g., volcano eruptions) and time sequence of the earthquakes forming isolated events, patches, swarms etc.  Investigations on earthquake predictions are based on the assumption that all of the regional factors can be filtered out and general Information about the earth quake precursory patterns can be extracted. This extraction process is usually performed by using classical statistical or pattern recognition methodology. Feature extraction involves a pre selection process of various statistical properties of data and generation of a set of the seismic parameters, which correspond to linearly independent coordinates in the feature space. The seismic parameters in the form of time series can be analyzed by using various pattern recognition techniques ranging from fuzzy sets theory and expert systems, multi – dimensional wavelets to neural networks. The prediction of the earthquakes is a very difficult and challenging task; we cannot operate on only one level of resolution. The coarse graining of the original data can destroy the local dependences between the events and the isolated earthquakes by neglecting their spatial localization.  In this manner, the subtle correlations between the earthquakes and preceding patches of events can be dissolved in the background of uncorrelated and noisy data.
We can extract local spatio - temporal clusters of low magnitude events and identify correlations between the clusters and the earthquakes. These clusters could reflect clearly the short term trends in seismic activities followed by isolated large events. However, local clustering of seismic events is not able to extract an overall picture concerning the precursory patterns. Data mining techniques, include not only various clustering algorithms but also feature extraction and visualization techniques. Multi – dimensional scaling procedures for visualization of multi - dimensional events in 3D space is used. This visual analysis helps greatly in detecting the subtle structures, which escape the classical clustering techniques.

Earth and Planetary Sci. Letters, August, 2003 - Earthquakes over Space, Time and Feature Space  by Cluster Analysis, Data-Mining and Multi dimensional Visualization.

Using Bloom Filters to Lower Cost of Large Join Jobs



Data management company <a href=”http://liveramp.com/”> LiveRamp</a> recently began opensourcing some of their internal data analysis and management tools. In this process they added a new tool for reducing the cost of MapReduce join jobs, BloomJoin. BloomJoin is useful when you are trying to join two groups where one is a very large dataset and the other is significantly smaller with a significantly smaller proportion of the data from the larger set. To complete this job normally, a user would first sort both sets of data, and then reduce both sets. This works fairly well but is inefficient with regards to sorting the larger dataset. To alleviate this BloomJoin first applies a bloom filter based on the target dataset to the larger dataset. A bloom filter is a probabilistic representation of datasets. By giving the filter a target set of objects it rejects objects that are not found within the target. Bloom filters hash the datasets keys and note the position of targets within an array. A bloom filer never creates false negatives, but can create false positives. This is why the bloom filter does not completely eliminate the need to sort and join the datasets. By applying a bloom filter to the data before sorting, the amount of data that must be sorted can be lowered dramatically. On LiveRamp’s test dataset running the BloomJoin job versus a standard CoGroup, and found that BloomJoin spent only 49.3% of the CPU time as the CoGroup job.
This is significant because, it can allow larger join tasks to be run with reduced costs in both time and dollars, and it is available to the open source community to improve upon further.

Sources: