Wednesday, April 3, 2013

Computer aided design document retrival by data mining technique

The advancement and widespread application of information technologies have been generating vast amount of electronic engineering documents. Among them, the computer aided design CAD documents may be the top ranked in terms of its quantity. The CAD documents are generated by engineers or architects while performing engineering tasks e.g conceptual planning, basic design, detailed design, and construction supervision. The quantity of CAD documents being generated usually depends on the type and size of the project. A typical five-floor residential building may require less than 100 CAD drawings. However, a mass transportation project may generate more than 200,000 CAD documents.

Due to massive growth of CAD documents, construction organizations are facing an increasing management costs needed both for storage and retrieval of the electrical CAD documents. The importance of the CAD document can be viewed from three aspects:

1. Provides an effective communication medium to illustrate the design concept of an engineering product, so that engineers and architects can “visualize” their ideas.

2. It is a legal document that provides a basis for performing, management, and closure of a contract.

3. Provides a useful library for engineers and architects to reuse previous design models in order to accomplish their design efficiently.

In construction practice, the cost items not included in the CAD documents are considered extra work that needs to be tackled with change orders. Moreover, when integrated with construction schedule, the CAD documents provide further help to the construction planner for progress control and dynamic resource allocation.

The computer aided design CAD document provides an effective communication medium, a legal contract document, and a reusable design case for a construction project. Due to technological advancements in CAD industry, the volume of CAD documents has been increased dramatically in the database of construction organizations. Traditional retrieval methods relied on textual naming and indexing schemes that require the designers to memorize in details the meta-information used to characterize the drawings. Such approaches easily overwhelmed the user’s memory capability and thus caused low re-usability of CAD documents. Content based text mining technique can be adopted to extract the textual content of a CAD document into a characteristic document, which can be retrieved with similarity matching using a Vector Space Model, so that the automated and expedited retrievals of CAD documents from vast CAD databases becomes possible. A prototype system, namely content-based CAD document Retrieval System, is developed to implement the proposed method. After preliminary testing with a CAD database and a public engineering drawing database, the proposed content-based CAD document retrieval system is proven to retrieve all relevant CAD documents with relatively high precision.

References:

Department of Construction Management, Chung Hua University, Hsinchu 300, Taiwan, ROC

Disney and Big Data

Disney started implementing big data techniques into their environment. They use Hadoop cluster to improve information sharing and communicate Disney’s departments. By collecting data from different departments, Disney can now analyze customers’ behavior such as attendance to the theme parks, purchases, and viewship of Disney TV programs. It was very exciting news for Disney because the price of introducing Hadoop was really low. Disney estimated that a Hadoop project only costs $300,000 to 500,000, and that was a really comparative bargain for a company earning millions of billions dollars.

And this year, Disney can track their customers’ behavior more convenient. This spring, magic band is introduced. Magic band is a wrist band which includes a RFID chip to be worn by theme park visitors. The chip is not only encoded personal information, preferences, and credit card information, but it also tracks their personal behavior. The Disney characters now can call your name when meeting with you. The band can even be a hotel key when you go back to a hotel in Disney theme park. This is such a new stuff recently, we still don’t know if people like it. In addition, I want to ask if the band is stolen, what should they do?

References:

1. Disney case study summarized from PricewaterhouseCoopers, Technology Forecast, Big Data Issue 2010.

2. http://www.davidajacobs.com/2013/01/disney-goes-big-data-with-magic-band/

To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase

Retrieving and analyzing data from a flight data recorder after a typical flight is not new. Airlines often check a quick-access recorder that operates in parallel with the flight data recorder, examining certain parameters to improve operations and safety. But current tools are limited to looking for known issues, and the amount of data can be staggering. MIT professor John Hansman says the key is developing analysis tools that can effectively utilize all the information.

Commercial airlines in the United States are not required to implement a flight-data monitoring program. But the Federal Aviation Administration has a flight-operations quality-assurance program that includes guidelines airlines can follow on a voluntary basis.

Airlines typically monitor known parameters that have helped identify issues in the past. Things like engine thrust and aircraft speeds, as well as flight control positions such as elevator and rudder inputs, are among the things studied at the end of a day’s flying or when flight data is analyzed after a crash.

Professor John Hansman says that “it’s a classic data-mining problem.”

A group of researchers in University of Washington developed very interesting data mining technique to predict the most optimal prices of flight tickets. You can see the full version of the paper in the following link below. Here is an interesting part that I like to share with you.

“Corporations often use complex policies to vary product prices over time. The airline industry is one of the most sophisticated in its use of dynamic pricing strategies in an attempt to maximize its revenue. Airlines have many fare classes for seats on the same ight, use di_erent sales channels (e.g., travel agents, priceline.com, consolidators), and frequently vary the price per seat over time based on a slew of factors including seasonality, availability of seats, competitive moves by other airlines, and more. The airlines are said to use proprietary software to compute ticket prices on any given day, but the algorithms used are jealously guarded trade secrets.”

“Product prices become increasingly available on the World Wide Web, consumers have the opportunity to become more sophisticated shoppers. They are able to comparison shop especiently and to track prices over time; they can attempt to identify pricing patterns and rush or delay purchases based on anticipated price changes (e.g., I'll wait to buy because they always have a big sale in the spring...").”

1- To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price,
Oren Etzioni , Dept. Computer Science University of Washington, Seattle, Washington 98195
etzioni@cs.washington.edu.

Utilizing Big Data with user verification

Even though everybody says that now is the time of Big Data, many companies concerns a glut of information and try to benefit from it. The solution for it may be to make an environment of securely accessing data to provide better customer experiences. Companies can provide exclusive and authoritative data and increase their margin by decreasing fraud.

Here are the examples.

Self-storage facility owners cannot auction off the belongings of a customer who is past due on their payments if that customer is on active-duty. They can check if he/she is on active-duty by accessing Big Data of military personnel information
By using Starfish EARLY ALERT, at-risk or low-performing students can be identified.
New apps like FluNearYou and Flu Trends from Google help monitoring epidemics and stopping from spreading

This kind of action has been recently made a great deal of progress, but still there are growing demand.

Reference: http://www.wired.com/insights/2013/01/the-utilitarian-side-of-big-data/

Tuesday, April 2, 2013

Big Data Ethics

Throughout this course we have all seen some amazing visualizations and great insights using big data. Big data has been around for a long time but data is now able to be stored in unprecedented amounts, especially personal data. Many of the visualizations using twitter data that we have seen has been very useful and interesting but what is in place to make sure these studies respect privacy? There is currently very weak regulations for collecting consumer data and privacy settings on web browsers are not legally binding. Regulation of data collection and use is a "Big Data" problem within itself. Many people are afraid that along with these insights using personal data could come profiling and discrimination. Last year the European Union came out with the Data Protection Directive. This directive has a very broad scope in protecting personally identifiable data and holding controllers responsible. The United States does not have anything that compares to this. So where do we look from here? Be conscious of what you do with your personal data and use good judgement when attempting to gain insight on any data that could be related back to individuals. I added some links below that touch on privacy regarding big data.

<http://searchcloudapplications.techtarget.com/feature/Big-data-collection-efforts-spark-an-information-ethics-debate>

<http://www.stanfordlawreview.org/online/privacy-paradox/big-data>

WEB DATA SOURCES FOR SPORTS (1)

This article I will share the web data sources of various main sports for people who are interested in getting data from them. As we all know, along with the development of internet technique, there are more and more web data sources than it’s used to be. Many of these data sources originate from their respective sports league’s official governing body, however At this point, there are also a few amount of third-party sources that offer useful data as well. The following are the sources I have gathered:

Baseball

MLB.com

The governing body of Major League Baseball, contains a wealth of sortable data and a variety of colorful and easy to understand graphical depictions of player performance.

Retrosheet.org

Retrosheet.org is a historical game data website with complete and continuous boxscore data since 1952, textual narratives of game play for nearly every major league game of record, player transaction data, standings, umpire information, coaching records, and ejections of players and managers alike.

Baseball-reference.com

it is another baseball statistics source that holds historical and current data, awards, league information, and a blogging feature where users can share information and insights.

baseball1.com

Started in 1995, the Baseball Archive started as a personal data collection and soon grew into an amalgam of multiple baseball data sources that can be freely queried by any user.

Basketball

NBA.com

This data sources ranges from basic statistical rankings by both player and teams, to more sophisticated plus/minus ratings and interactive graphics of player point shooting.

Basketball-reference.com

This site attempts to be comprehensive, well-organized, and responsive to data requests. The basketball data is relatively straight-forward and easy to navigate.

Cricket

Cricinfo.com

ESPN’s cricinfo.com bills itself as the top cricket website that includes cricket news, analysis, historical data as well as real-time matchups.

Howstat.com

Howstat.com is another Cricket data repository with many features. Aside from having historical and real-time data, howstat.com also contains a superb searching and sorting application to make data requests simple and easy to use.

Football

NFL.com

The National Football League, governing body of American football, also keeps data on their official league website of NFL.com. This data is fairly standard, composed of top ranked players, player comparisons, and team statistics.

Pro-football-reference.com

Pro-football-reference.com provides ample statistics, analysis and commentary to hold any football enthusiasts interest. Users can peruse reams of data regarding coaches, the draft, historical boxscores and team rosters over the years, much of which is unavailable through the official league’s website.

AdvancedNFLStats.com

AdvancedNFLStats.com is a more research-driven collection of football enthusiasts that share their insights and passion for the sport. While this website does not contain the usual faire of historical or real-time data, it instead focuses on sabermetric-styled creations such as game excitement rating, comeback expectancy, etc.

Tutorial: Python 3.3.0

If we need use Amazon EC2 to do the Big Data project, we have to know the Python programming language. The example we used in the class had been explained by Xinyu. Python runs on Windows, Linux/Unix, Max OS X, and has been ported to the Java and .Net virtual machines.

Python is free to use. Today I want to show you how to install Python software and the detailed tutorial.

1. Go to www.python.org

2. Choose the right installer for your OS in the download page. I choose Python 3.3.0 Windows X86-64.

3. Follow the steps to install it.

4. Choose IDLE in your programs to start programming.

5. If you want to program in files. You can choose file >> New window. And after the programming, you can click Run >>> Run the Module.

Now you have know how to install and use python. Next I want to show you some detailed tutorial for programming in Python. This is a YouTube playlist. The playlist address is https://www.youtube.com/watch?v=4Mf0h3HphEA&list=ECEA1FEF17E1E5C0DA You might need 10 hrs to learn that. It is very helpful.

How Netflix Recommendations Are Made

Netflix uses a wide array of Big Data techniques to generate their above average recommendations. Netflix uses machine-learning algorithms heavily, essentially before or after almost every other step, in generating recommendations. This focus is important because it raises significant issues with processing. With online processing, user interactions are responded to rapidly, but the amount of data that can be processed and the computational complexity of the processing are limited. Offline processing alleviates both of these issues, but lowers responsiveness, increasing the likelihood of data becoming outdated during processing. Nearline processing is a middle ground option that allows for online processing but is not required to occur in real time. With each of these possibilities come complex consequences and side effects. To control this, Netflix uses a combination of all three methods of processing across Amazon’s Web Services in an architecture illustrated below.

As you can see, this is an extremely complex setup. Netflix uses offline processing for calculating overarching trends or other things that require no user input, as well as machine learning to develop algorithms that can be used for result calculations. Nearline processing is used largely to develop backup plans should online processing fail to produce results as quickly as required. Nearline is also used in situations where time is of less importance than accuracy, for instance updating recommendations to show that a movie has been watched, while the user is watching the movie. Online computing is used largely in response to user activity, such as searching for a category. Netflix’s hybrid approach is particularly useful in situations where intermediate results can be batch processed and then used to calculate more specific results in real time in response to user activity. Most of Netflix’s model training and machine learning is done offline and then used online.

Netflix's hybrid approach is particularly important to big data, because it manages to create very strong recommendations, less likely to be accomplished using only online or nearline methods, while still maintaining a fast response time that would not be possible using only offline approaches.

Source: http://techblog.netflix.com/2013/03/system-architectures-for.html

"Google facing fines in EVERY EU country as Information Commissioner launches probe into search giant's privacy policy"

I found this article online @ Dailymail.co.uk --> It looks like a really interesting article and even discusses some of the topics mentioned today in class. Take a quick read and let me know what you think in the comment box.

Google-facing-legal-action-EVERY-EU-country-data-goldmine-collected-users.html