Analytics and Visualization of Big Data

Tuesday, April 9, 2013

Mending Our Streets with Big Data

While researching articles for the usage of big data in vehicles I stumbled across a very interesting article by the Wall Street Journal. This article was about utilizing big data to help control traffic in Woodbridge, N.J. and Boston MA. They utilize some very ingenious methods to gather data and then interpret that data into something meaningful, resulting in a substantial savings for the local municipalities.

In the article they discuss Woodbridge first. The way they have utilized big data was to hire a fresh new upstart company to try and solve their traffic problems on the 6800 miles of roadways in the state. This company is Inrix Inc. a startup company based in Washington state. Their idea was to gather data from cell phone signals and GPS signals to determine the speed of traffic, weather, and events. They used this data to populate a map of the roadways in New Jersey and see the traffic flow in nearly real time. The heart of this system is housed in Woodbridge and is monitored on a 22ft tall screen. They describe an event that shows the potential this new system has. They say they saw one of the major thoroughfares go red, indicating stopped traffic. They quickly realized this was due to an accident and they sent crews out to resolve the problem. The road was only disturbed for 30 minutes versus the hours it would have taken before.

In Boston they have been battling potholes. The detection of these potholes takes a huge amount of time and cost in excess of $200000 a year. Their idea was to create an app that could register where potholes where and relate that information back to the department of transportation. This app utilizes the accelerometer built into smart phones to determine when the vehicle hits a pothole then the location is stored. This app will replace an extremely outdated system of dragging chains over all the roads in Boston and measuring the vibrations. The app is reported to cost $80000 which will replace a the outdated procedure which costs $200000.

In my opinion these are two ingenious examples of ways to solve very complex problems. Using big data analytics can allow for even more progress in our nation’s infrastructure in the future I believe. The local municipalities should continue to allow for contractors and government employees to think outside of the box to create more efficient and cost effect ways to fix complex infrastructure problems.

For the complete article please follow the link below.

Sources:

1. http://online.wsj.com/article/SB10001424052702303444204577460552615646874.html

Big Data and Management

With the explosion of Big Data, many managers are wondering how they can get in on a piece of the action. The article says Big Data is important to management because they can measure business performance and make adjustments. You have to be able to measure before you can fix.

Booksellers in physical stores could always track which books sold and which did not. But with online sales, booksellers are now able to get a better understanding of the customers and their purchasing habits. They can not only track what they bought, but what they looked at, how much they were influenced by page layouts and promotions, and how they were swayed by reviews. Traditional retailers were put out of business because they could not keep up with these new algorithms.

This new online retail standard has set the bar high for new online companies. People expect new digital companies to accomplish things that were unheard of a decade ago. The article says that these powerful tools can be used for management decisions. “We can make better predictions and smarter decisions. We can target more-effective interventions, and can do so in areas that, so far have been dominated by gut and intuition rather than by data and rigor.”**

The article separates Big Data and Analytics by highlighting two key differences between the two. Velocity and Volume are the two big differences between Analytics and Big Data.

Volume: As of 2012, about 2.5 exabytes of data are created each day, and that number is doubling every 40 months or so. More data cross the internet every second than were stored in the entire internet just 20 years ago. This gives companies an opportunity to work with many petabyes of data in a single data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets’ worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes.**

Velocity. For many applications, the speed of data creation is even more important than the volume. Real-time or nearly real-time information makes it possible for a company to be much more agile than its competitors. For instance, our colleague Alex “Sandy” Pentland and his group at the MIT Media Lab used location data from mobile phones to infer how many people were in Macy’s parking lots on Black Friday—the start of the Christmas shopping season in the United States. This made it possible to estimate the retailer’s sales on that critical day even before Macy’s itself had recorded those sales. Rapid insights like that can provide an obvious competitive advantage to Wall Street analysts and Main Street managers. **

Source: http://hbr.org/2012/10/big-data-the-management-revolution/ar/1**

The Hidden Biases in Big Data

Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.
For example, consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don't represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city's high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.
While massive datasets may feel very abstract, they are intricately linked to physical place and human culture. And places, like people, have their own individual character and grain. For example, Boston has a problem with potholes, patching approximately 20,000 every year. To help allocate its resources efficiently, the City of Boston released the excellent Streetbump smartphone app, which draws on accelerometer and GPS data to help passively detect potholes, instantly reporting them to the city. While certainly a clever approach, StreetBump has a signal problem. People in lower income groups in the US are less likely to have smartphones, and this is particularly true of older residents, where smartphone penetration can be as low as 16%. For cities like Boston, this means that smartphone data sets are missing inputs from significant parts of the population — often those who have the fewest resources.
Fortunately Boston's Office of New Urban Mechanics is aware of this problem, and works with a range of academics to take into account issues of equitable access and digital divides. But as we increasingly rely on big data's numbers to speak for themselves, we risk misunderstanding the results and in turn misallocating important public resources. This could well have been the case had public health officials relied exclusively on Google Flu Trends, which mistakenly estimated that peak flu levels reached 11% of the US public this flu season, almost double the CDC's estimate of about 6%. While Google will not comment on the reason for the overestimation, it seems likely that it was caused by the extensive media coverage of the flu season, creating a spike in search queries. Similarly, we can imagine the substantial problems if FEMA had relied solely upon tweets about Sandy to allocate disaster relief aid.
This points to the next frontier: how to address these weaknesses in big data science. In the near term, data scientists should take a page from social scientists, who have a long history of asking where the data they're working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation.
We get a much richer sense of the world when we ask people the why and the how not just the "how many". This goes beyond merely conducting focus groups to confirm what you already want to see in a big data set. It means complementing data sources with rigorous qualitative research. Social science methodologies may make the challenge of understanding big data more complex, but they also bring context-awareness to our research to address serious signal problems. Then we can move from the focus on merely "big" data towards something more three-dimensional: data with depth.

Source: http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html

Do You Wanna Save Some Moneys??

Do you spend way too much money on your car insurance? Do you want to save some of that hard earned money for a Koala Bear or a visit to the salon to cut off that rat tail you’ve had since the 3^rd grade? If so you are in luck and big data is here to help. Progressive and Allstate are utilizing big data analysis to lower car insurance premiums.

Progressive’s Snapshot and Allstate’s Drive Wise are the latest attempt to harness the power big data holds. Both of these devices plug into your vehicle’s OBD-II port and record data directly from your vehicle’s on board computer. This data is then transmitted wirelessly via cell phone towers to the company’s data collection center. They are capable of storing many different parameters allowing the insurance company to create a vivid digital representation of your driving style. These devices tend to record the frequency of hard braking, speed, when you drive, and what you drive. Snapshot says it does not record your position during this time or record if you are speeding. But, since the data is sent to Progressive via cell phone towers they likely know your general area at the bare minimum.

There are some limitations to this product though, that potential customers should be aware of. These devices reward drivers that drive a limited amount of time between the hours of midnight and 4 a.m. Also, drivers that frequently don’t drive on freeways or highways as they have a much lower speed limit. If you have a long commute to work every day during rush hour this product is most likely not for you. The reasoning behind that is you get a less desirable rating the harder and more frequently you brake, the faster you drive, and when you drive. My suggestion is if you fall into that category then give up on snapshot. If you drive seldomly or just short distances during none peak hours then this device is a perfect match for you and could possibly save you some money for that well deserved hair cut. One thing that also must be considered is what are they going to do with your digital personality?

Will they sell your data to other companies? Will they keep this data forever? Well this is what progressive has to say about it “We won't share Snapshot information unless it's required to service your insurance policy, prevent fraud, perform research or comply with the law. We also won't use Snapshot information to resolve a claim unless you or the registered vehicle owner permits us to do so.”(1) They also say this about how long they will keep your data “It varies depending on where you live. At minimum, we'll follow the rules established by your state's department of insurance.”(1) I feel that the first response is valid and not to offending, but I am not as comfortable with the last answer. I would feel more comfortable if they were more honest and upfront about their intentions with all of this very personal data. If they plan on keeping it for the life of your plan, 10 years, or indefinitely (mostly likely I would assume) tit should be mandatory for them to explicitly tell their customers of their intentions so the customers can decide for themselves.

Big data in the car insurance arena is here to stay. For that matter it is here to stay in the entirety of the insurance market not just the auto insurance side of the business. Potential customers should be wary of how your data is shared and what driving habits have to be standard in your daily driving routine in order to save money. But, for the people that fit into the categories being monitored, then they are very likely to see a reduction in the monthly premiums they pay.

Sources:

1. http://www.progressive.com/auto/snapshot-common-questions/

2. http://www.allstate.com/drive-wise.aspx

3. http://www.jeffkramer.com/2012/01/31/the-quantified-car-progressive-snapshot/

Using BigML to Decide Credit Card Approval

With the current economic situations, credit card companies are using big data analytics now more than ever. If you have ever wondered how you can either get approved or disapproved for a credit card within a few minutes if not seconds based on just a few numbers input by you then this blog is for you. Credit card companies have collected years of data and have set up certain predictive metrics to figure out if the person applying for a credit card should be approved or not. I found a great program for this type of analysis called BigML. This is another free program that is incredibly powerful.

In order to use BigML you must first create a username and password. You will receive an email with a confirmation link once you have input your information. Then click on the confirmation link in your email and you are ready to start working with this amazing product. It has a few datasets already listed for you to learn with and one of those is, “Credit Application’s dataset.” I used that data to set up diagrams and a prediction sheet.

Once you have input the data your screen will change to a breakdown screen of the different categories within the dataset provided. This is represented in figure 1 below. After you are on that screen you can click on view dataset and it will change to show miniature graphs of each category on the right side of the screen. This is shown in Figure 2 below.

Figure 1: Data Category Breakdown

Figure 2: Display of Miniature Category Graphs

Once you are done with that feature you can click on the Models tab at the top. This will take you to a new screen where you will need to click on the model you wish to look at. In my opinion this is the best part of the program because it gives you an overall view of how each path can be traced to see which people would be approved or denied. This could be useful to anyone thinking about getting a credit card because they could look at the branch that they fit in and have a good idea if they would be approved for a credit card. Figure 3 below shows what this looks like with a good path of 64.61% confidence based on the answers provided on the right side of the screen. Figure 4 shows a representation of a bad applicant with a 52.30% confidence based on the information on the right of the screen.

Figure 3: Good Applicant

Figure 4: Bad Applicant

Once you are done looking at this portion of the program you can click on the predictions section of the program. This section is probably the most useful for the credit card company because you input the applicant’s information and it will tell you if this applicant is a good or bad person for credit approval. This allows for quick approval for retail credit cards. Figure 5 below shows the prediction screen and some of the input sections.

Figure 5: Prediction Screen

If you are interested in trying out this program please go to:

https://bigml.com/home

How big data improves marketing?

Predictive Analysis research and ‘Big Data’ is helping companies to improve their marketing, through accessing multiple sources of data and employing specialist staff to ‘understand the data’, providing actionable data which Marketers and Decision-makers can use to improve performance.

This IBM and the Aberdeen Group ‘Big Data for Marketing’ research set out to identify the ‘Best in Class’ practices for companies providing accurate information in an efficient way, to support key Decision-makers and their planning processes.

The Infographic below summarises the top strategies, results and capabilities:

Infographics big Data survey IBM and Aberdeen Group

It highlights that it has helped companies to become more efficient, improve their decision-making and overall performance, through actionable high quality data analysis. The detailed report draws out the technologies and business capabilities of those being used by ‘Best in Class’.

Analyst says that ‘customer analytics has become the driving force behind big data developments,’ so it’s of little surprise that many of the “best-in-class” enterprises in the report came from the fields of retail and telecommunications. Rowe also said specific use cases are popping up as having big, quick returns, like fraud detection in financial services or the analysis of sensor data in utilities.

They surveyed 125 organisations across the globe, with an on-line survey to identify:

· Data collection sources

· Efficiency to manipulate and analyse data

· Accuracy and quality of data.

· Tools and resources to support this.

Some of the findings concluded:

· 93% relied on their data quality

· 35% realised a year on year increase in ‘accessible data’

· successful companies up-skilled staff in analytics with training and external Analysts to support them

· those ‘Best in class’ were 5x more likely to have a ‘Data Scientist’

· companies experienced improved data agility through analytical exploration and drill-down tools.

Reference:

1. http://aberdeen.com/Aberdeen-Library/7196/RA-big-data-management.aspx

Big Data - A new weapon for the ATF

While some law enforcement agencies are seeking technology to find patterns in existing data, others are clamoring for access to even more data on the web. The Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF) has faced mounting criticism for using what many consider to be antiquated technology. Critics referred the technology used by ATF as extremely outdated and the agency itself is a national embarrassment. The agency is prohibited from creating a federal registry of gun transactions. Instead, when federal agents want to trace the source of gun sales, they must search records on microfilm, a process that can take as long as five days.

The ATF is now seeking proposals for "a massive online data repository system" that could allow their agents to make faster connections between suspects' names, social security numbers, telephone numbers and utility bills, according to a request issued last month. The new database would not be used to analyze gun purchases, but instead would be used to gather publicly-available data without requiring agents to go to multiple sources.

[ https://www.fbo.gov/index?s=opportunity&mode=form&tab=core&id=974186e6ff5bfc7bfed500f5d51af352&_cview=0]

The database used by ATF analyzes the data largely by hand resulting in longer turnaround times on important information and intelligence research and analysis requests. Mark Tanner - President of Law Enforcement and Intelligence consulting mentioned that it's difficult for law enforcement to use the information effectively because it's not connected in a single database. Computing power would dramatically reduce the amount of time it takes federal agents to link pieces of information on suspects.

Blog posts " "Minority Report" - Not that distant future? " by Jason Buckner and "IBM's tool to tackle crime" by Anto Jeson Raj throws light on how Police departments across the country are using data analytics to predict where a crime is likely to occur and deploy resources to those areas. The FBI is creating a database that will connect suspects to crimes using not just fingerprints, but also palm prints, iris scans and images of faces.

Reference : http://www.huffingtonpost.com/2013/04/08/atf-database_n_3038271.html

Monday, April 8, 2013

BIG Data Problems in the US Armed Forces

In terms of the military, most people only associate big data uses with upper level intelligence, battalion level and higher. This is, or at least quickly becoming a misconception. One obstacle to achieving intelligence flow down to platoon and even squad level is the current technology. Technology currently in use in the United States Army on the ground is surprisingly inefficient. The article mentions how soldiers come back from deployment to a cellular device with more advanced communications and global positioning technology than they used during their 10 to 14 month deployments. There are current testing methods trying to remedy this, but much improved technology will most likely not be available on a large scale for the foreseeable future. One proposal is the “rapid adoption of commercial wireless networks.”

The DoD has been focused on methods to collect data through programs such as Gorgon Stare, Blue Devil 2 and ARGUS so intently that key aspect to making this data useful is overlooked. The programs are producing data at sometimes a petabyte a day which is far beyond the amount that can be analyzed. Currently only about a third of the data will be analyzed. While we won’t be able to move that kind of data in the foreseeable future, we are creating technology that will pick out the more useful information.

Three different approaches are considered:

· Increase on-board processing of data

· Integrate analytics into data storage

· Automated tiering of data storage

http://www.armedforcesjournal.com/2012/10/11458170/

Big Brother at School?

One would hope to think that school is the safest place for children. Unfortunately, we have been proven wrong of that idea more and more over the years. Besides terrorists coming into school buildings with firearms, there is now another reason that we should be worried about the safety and privacy of school children.

Teachers spend plenty of time with their students throughout the school year. Not only are the teachers teaching, but they are also learning about their pupils. Teachers know certain facts about every child in their classroom: sex, age, address, exam results, if they have special needs, behavior habits, and absenteeism. A new computer network, known as the “One System” is a place that now houses around eight million children’s personal information across different UK countries.

Information, like stated above, is gathered by teachers. This information is then submitted to the One System up to six times a day. This steady flow of information is said to provide a “golden thread of data”. The scariest thing to me is that the firm hires photographers to take pictures of the school children. These images are then offered for sale to parents before they are uploaded to the database. Basically, the company is taking advantage of the information that they gather. I wonder how it is legal for the company to basically bribe parents.

The information that is gathered about these millions of children can be shared with numerous agencies. The police, the NHS child protection units and charities are all possible agencies that could gather the information stored in these massive databases. All of these agencies can get their hands on this confidential information without the parents ever consenting to it.

School management databases for local councils have been around for several years. These councils are now able to upload their existing data to the One Database. Already, in Swindon, England, it has been recorded that 48,000 pupils are stored on the One System database and have been shared with health officials at NHS hospitals and teams that work with young offenders. While it is obvious that it is possible to share this information to different agencies, there are still some kinks. Privacy advocates claim that one of the main problems with the One System is that it is inconsistent among schools because it is not a centralized government system. Because the information is compartmentalized, it is hard to share information effectively and quickly, even though it can be done.

There are two things that have been said by Nick Pickles, privacy advocate of Big Brother Watch, to consider when thinking about One System’s large database of child information:

1. “Child protection can not be delegated to an algorithm without local or individual knowledge of that child. Databases and computers remove human judgment.”

2. Once this data is on a database it “may be lost, stolen or misused.”

We can only hope that this “Big Brother” system does not spread to America’s children. There are already so many worries that parents have when sending their children to school. This is something that we should not have to worry about. School should be a safe place for children, and if anyone can be trusted it should be teachers—the ones who watch over school children for the majority of the week.

Sources:

http://rt.com/news/children-at-risk-data-522/