Analytics and Visualization of Big Data

Sunday, April 7, 2013

Data mining using geographical data

Most organizations collect and maintain some type of geographic data, yet many ignore this data during analysis. Any business has some record of customer addresses, for instance, but this data is usually formatted in an awkward, non-numeric form. Geographic data can be very predictive, though, since behaviours being predicted often have some correlation to location.

So, how might one use geographic data? Possible answers depend on several factors, most importantly the volume and type of such data. A company serving a national market in the United States, for instance, will have customer shipping and billing addresses (not necessarily the same thing) for each customer (possibly for each transaction). These addresses normally come with a range of spatial granularities: street address, town, state, and associated ZIP Code (a 5-digit postal code).

Even at the largest level of aggregation, the state level, there may be over 50 distinct values (besides the 50 states, American addresses may be in Washington D.C. [technically not part of any state], or any of a number of other American territories, the most common of which is probably Puerto Rico). With 50 or so distinct values, significant data volume is needed to amass the observations needed to draw conclusions about each value. Given the best case scenario, in which all states exhibit equal observation counts, 1,000 observations breaks out into 50 categories of merely 20 observations each- not even enough to satisfy the old statistician's 30 observation rule of thumb. In data mining circles, we are accustomed to having much larger observation counts, but consider that the distribution of state values is never uniform in real data.

Using individual dummy variables to represent each state maybe possible with especially large volumes. Possibly an "other" category covering the least frequent so many states will be needed. Another technique which I have found to work well is to replace the categorical state variable with a numeric variable representing a summary of the target variable, conditioned by state. In other words, all instances of "Virginia" are replaced by the average of the target variable for all Virginia cases, all instances of "New Jersey" are replaced by the average of the target variable for all New Jersey cases, and so on. This solution concentrates information about the target which comes from the state in a single variable, but makes interactions with other predictors more opaque. Ideally, such summaries are calculated on special hold-out set of data, used just for this purpose, so as to avoid over-fitting. Again, it may be necessary to lump the smallest so many states together as "other". While I have used American states in my example, it should not be hard for the reader to extend this idea to Canadian provinces, French departments, etc.

Most American states are large enough to provide robust summaries, but as a group they may not provide enough differentiation in the target variable. Changing the spatial scale implies a trade-off: Smaller geographic units exhibit worse summary variance, but improved geographic differentiation. American town names are not necessarily unique within a given state and similar names may be confused (Newtown, Pennsylvania is quite a distance from Newtown Square, Pennsylvania, for instance). In the United States, county names are unambiguous, and present finer spatial detail than states. County names do not, however, normally appear in addresses, but they are easily attached using ZIP Code/County tables easily found on-line. Another possible aggregation is the Section Code Facility, or "SCF", which is the first 3 digits of the ZIP Code.

In the American market, other types of spatial definitions which can be used include: Census Bureau definitions, telephone area codes and Metropolitan Statistical Areas ("MSAs") and related groupings defined by the U.S. Office of Management and Budget. The Census Bureau is a government agency which divides the entire country in to spatial units which vary in scale, down to very small areas (much smaller than ZIP Codes). MSAs are very popular with marketers. There are 366 MSAs at present, and they do not cover the entire land area of the United States, though they do cover about 85% of its population.

It is important to note that nearly all geographic entities change in size, shape and character over time. While existing American state and county boundaries almost never change any more, ZIP code boundaries and Census Bureau definitions, for instance, do change. Changing boundaries obviously complicates analysis, even though historic boundary definitions are often available. Even among entities whose boundaries do not change, radical changes in behaviour may happen in geographically distinct ways. Consider that a model built before hurricane Katrina may no longer perform well in areas affected by the storm.

Also note that some geographic units, by definition, "respect" other definitions. American counties, for instance, only contain land from a single state. Others don't: the third-most populous MSA, Chicago-Joliet-Naperville, IL-IN-WI, for example, overlaps three different states.

Being creative when defining model inputs can be as helpful with geographic data as it is with more conventional data. In addition to the billing address itself, consider transformations such as: Has the billing address ever changed (1) or not (0)? How many times has the billing address changed? How often has the billing address changed (number of times changed divided by number of months the account has been open)? How far is the shipping address from the billing address? And so on...

Much more sophisticated use may be made of geographic data than has been described in this short posting. Software is available commercially which will determine drive time contours about locations, which would be useful, for instance when modelling retail store location revenue models. Additionally, there is an entire of statistics, called spatial statistics, which defines an entire class of analysis procedures specific to this sort of thing.

Reference:

1. http://www.geocomputation.org/1999/051/gc_051.htm

2. http://www.kdnuggets.com/websites/blogs.html

Data mining in Restaurants

With the food business thriving again in the midst of America’s economic upswing – consistently claiming a whopping 4 percent of GDP — some of the nation’s top eateries are quietly embracing data mining to eke out profit in a tough economy. Timothy is a waiter at the Landmarc restaurant at the Time Warner Center, New York. Timothy has always been a great worker — he clocks in on time and never forgets an order. But his sales of beverages and side dishes were falling short last year. In one month, Timothy has served 426 customers, pulling in $17,991.50 in gross sales with a per-check average of $42.23. That’s $3.84 below the overall per-check average at the Landmarc. It turns out that while Timothy was beating the rest of the waitstaff in add-on sales like bacon or cheese on a burger, he was lagging 2 percent behind everybody else in red wine and liquor sales, and a whopping 14 percent behind his peers in sides like French fries and creamed spinach.

The bottom line was $1,636 of lost sales opportunity in a month — the money Timothy would have made if he’d hit the server average.

We know all this because every item sold at Landmarc — down to the last malbec, martini and red quinoa pilaf — is individually logged and enumerated by a sophisticated software package called Slingshot. The software slices, dices and crunches the data every night, and then serves it to managers with breakfast the next morning.

So when Timothy was up for a performance review last summer, the restaurant’s general manager knew everything about him — information she incorporated into a heart-to-heart talk about improving his average.

Another example would be that of Applebee’s. Data mining has provided Applebee’s with immense insight on how back-of-the-house operations effect front-of-the-house performance. They were able to track the time it took for certain areas of the of the restaurant’s performance but left many other areas to be explored. Applebee’s could use the knowledge they have gained to analyze other areas such as support positions, the bar area, and even customer waiting time.

Many restaurants employ busboys, hostesses, drink runners, bar backs, and other support positions for the main staff. The servers benefit most from this support and sometimes can cater to many more tables than normal because of the support staff. Could Applebee’s use their data to discover the optimum number of tables for a server at a given amount of time? Or to find out the number of support staffs needed by the servers? For example, if there were normally X busboys and Y tables, and then went down X-1 or X-2 busboys, you would be able to discover how many extra servers are needed. The servers are now going to have to spend more time clearing tables. Data mining could be used to discover if it makes sense for the bar tender to do his/her own stocking throughout the night or to employ a position known as a bar back. Also, the information can be use to find out the time taken to make drinks, which can be used to find out the relationship between delayed drinks and delay in service.

The wait time at a restaurant can be a critical factor in deciding if a customer dines with you or goes down the street. Often times a wait is inevitable on weekends or “hallmark” holidays. Applebee’s could use data mining to discover how long customers wait before they hand back the beeper? They could test if offering drink or perhaps a free appetizer while waiting is able to captivate the customers if the wait is more than Z amount of time. Also, Applebee’s could investigate the results of giving customers smaller menus to flip through appetizers and/or desserts during wait periods. Will it make the customers more likely to order an appetizer or desserts? Will customers order desserts after the main course because they have been thinking about that brownie sundae for the last hour?

Data mining has provided Applebee’s with many ideas and strategies. If they continue using their data mining operation they will possibly be able to gain a competitive advantage and hence increase profitability.

Sources: http://ismteam10.wordpress.com/2011/09/20/applebees-benefits-from-data-mining/

http://www.wired.com/wiredenterprise/2012/05/restaurant-moneyball

How will Big Data change trials?

If you have looked at pages of a law, you may feel dizziness due to very long sentences with legal terms that general readers cannot immediately catch the meaning of. How about trial records or precedents? There should be a diversity of interpretations of law. This job has been thought of as the professional area. With the advent of Big Data era, many discussions about how Big Data will affect the legal profession have still been being made. The article I refer to gives some evidences and interviews in the answer of this paradigm change.

Quantitative legel prediction
It is relevant to how to manage legal matters and costs, how to craft legal arguments, and whether, how, and where they file a lawsuit. The critical point to build this prediction is to gather usable data that computers can understand. As a example of it, E-discovery uses algorithms to review mountains of documents and predict which are likely to be relevant in a given case.

TyMetrix
Wolters Kluwer Corporate Legal Services, a vendor of e-billing and matter management systems for corporate law departments, collects data on billings and legal matters with its customers' permission. It is for benchmarking law firm rates and identifying the factors that drive them. It also offers a free app for mobile devices.

Fantasy SCOTUS
It is a web-based fantasy league for predicting Supreme Court decisions. This site combines the crowd-sourced data with data from publicly available court filings, then use an algorithm and decision engine to make predictions.

Lex Machina
It focuses on patent litigation. The database holds information from 128,000 IP (Intellectual Property) cases, 134,000 attorney records, 1,399 judges, 63,000 law firms and 64,042 parties, spanning the last decade.

Reference: http://www.law.com/jsp/lawtechnologynews/PubArticleLTN.jsp?id=1202555605051&Big_Data_Meets_Big_Law&slreturn=20130302131047

Expectation Maximization Algorithm

The expectation-maximization (EM) algorithm is a broadly applicable approach to the iterative computation of maximum likelihood (ML) estimates, useful in a variety of incomplete-data problems. In particular, the EM algorithm simplifies considerably the problem of fitting finite mixture models by ML, where mixture models are used to model heterogeneity in cluster analysis and pattern recognition contexts.

EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori probability (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

The EM algorithm has a number of appealing properties, including its numerical stability, simplicity of implementation, and reliable global convergence. There are also extensions of the EM algorithm to tackle complex problems in various data mining applications. It is, however, highly desirable if its simplicity and stability can be preserved. Maximum likelihood estimation and likelihood-based inference are of central importance in statistical theory and data analysis. The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Maximum likelihood estimation is a general-purpose method with attractive properties. Finite mixture distributions provide a flexible and mathematical-based approach to the modeling and clustering of data observed on random phenomena.

The EM algorithm is an iterative algorithm. Its iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.

Reference:

Wu, X., Kumar, V., (2009), The Top Ten Algorithms in Data Mining, Chapman & Hall / CRC.

Saturday, April 6, 2013

Data Mining in Dairy Farming

I have already discussed the company mentioned in the first half of this article. However, I have not thought about big data uses in dairy farming which is what is focused on in the second portion of this article.

The Holstein bull Badger-Bluff Fannie Freddie is considered Americas best. He has fathered 346 dairy cows up to September of last year. The article goes on to explain that he is considered the best Holstein bull in the states because of his genetic makeup. The USDA analyzed approximately 50,000 different markers in his DNA that are supposedly related to better milk production and, based on what they found, proclaimed him to be the best bull for procreating good dairy cows. A post on the Sustainable America blog claims that “dairy breeding is a perfect field for quantitative analysis of the sort that machine learning algorithms can offer today. Taking vast amounts of data and scanning for key information is what these algorithms are created to do.” Italicized is basically the purpose and definition of data mining.

The post states that nearly 100 years ago dairy cows were only expected to produce around 5000 pounds of milk in a lifetime, while today that number is up to 21,000. This is the product of a century of selective breeding. Mining the massive amount of information DNA provides for the select markers of good milk production will enable dairy farmers to achieve a whole new level of breeding, thereby increase this average even faster.

Surely over the next few years, we will see data mining taking a much more prominent role not only in agricultural farming as expected, but in dairy farming as well.

http://www.allanalytics.com/author.asp?section_id=2220&doc_id=250468

Integrating Data mining into smartphones

Smartphones can obtain information about its owner, and many researchers are dedicated to finding ways to gather and interpret the most useful information. Modern smartphones are packed with many powerful sensors that enable the phone to collect data about you. Although that may alarm anyone who is concerned about privacy, the sensors also present an opportunity to help smartphone users in previously impossible ways. The WISDM (Wireless Sensor Data Mining) Lab led by Dr. Gary Weiss is concerned with collecting the sensor data from smart phones and other mobile devices and mining sensor data for useful knowledge. Smartphones contain more sensors than most people would ever imagine. Android phones and iPhones include an audio sensor (microphone), image sensor (camera), touch sensor (screen), acceleration sensor (tri-axial accelerometer), light sensor, proximity sensor, and several sensors (including the Global Positioning System) for establishing location.

Their first goal was to use the accelerometer to perform activity recognition -- to identify the physical activity, such as walking, that a smartphone user is performing which could be used as the basis for many health and fitness applications, and could also be used to make the smartphone more context-sensitive, so that its behavior would take into account what the user is doing. The phone could then, for example, automatically send phone calls to voice mail if the user was jogging. They have used existing classification algorithms to identify activities, such as walking, and help map accelerometer data to those activities. Also, they have also found that one's gait, as measured by a smartphone accelerometer, is distinctive enough to be used for identification purposes from a pool of several hundred smartphone users with 100 percent accuracy using previous data sample. This application is important since gait problems are often indicators of other health problems. All of these applications are based on the same underlying methods of classification as our activity recognition work.

They have collected a small amount of labeled "training" data from a panel of volunteers for each of these activities such as walking, jogging, climbing stairs, sitting, standing, and lying down, with the expectation that the model that system generates will be applicable to other users. Initially, the system could identify the six activities listed above with about 75 percent accuracy. These results are adequate for obtaining a general picture of how much time a person spends on each activity daily, but are far from ideal. However, if a small amount of data can be obtained that a user actively labels as being connected with a particular activity, they will be able to build a personal model for that user, with accuracy in the 98-99 percent range. This shows that people move differently and that these differences are important when identifying activities. A system Actitracker allows you to review reports of your activities via a web-based user interface. This will determine how active or how inactive you are. These reports may serve as a wakeup call to some and hope it will lead to positive changes in behavior. Such a tool could also be used by a parent to monitor the activities of their child, and thus could even help combat conditions such as childhood obesity.

This category of applications is part of a growing trend towards mobile health. As new sensors become available and as existing sensors are improved, even more powerful smartphone-based health applications should appear. For example, other researchers are boosting the magnification of smartphone cameras so that they can analyze blood and skin samples. Researchers at MIT's Mobile Experience Lab are even developing a sensor that attaches to clothing, which will allow smartphones to track their users' exposure to ultraviolet radiation and the potential for sunburn. Smartphone sensor technology, especially when combined with data mining, offers tremendous opportunities for new and innovative applications. Looking at this applications, it is estimated that there will be a flood of new sensor-based apps over the next decade.

http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/workshops/WKS1/doc/paper7_cameraready.pdf

http://www.cis.fordham.edu/wisdm/index.php

Visualization: Unemployment Rate vs. Labor Participation Rate using iCharts

For this visualization project, I am using iCharts to help illustrate the changes in the labor participation rate and unemployment rate of the United States over the years 1948 to 2013. Historically, the unemployment rate has been used as a measure of the economic strength of our nation. The unemployment rate is calculated as a percentage by dividing the number of unemployed individuals by all individuals currently in the labor force. According to Wikipedia, the labor force participation rate is the percentage of the population that are 16 years of age and older residing in the 50 States and the District of Columbia who are not inmates of institutions (penal, mental facilities, homes for the aged), and who are not on active duty in the Armed Forces that are employed or are unemployed and actively seeking a job.

iCharts

From the above unemployment rate graph, the rate has been steadily decreasing since 2009, indicating an improving economy. According to the Bureau of Labor Statistics (BLS), persons not in the labor force who want and are available for a job and who have looked for work sometime in the past 12 months (or since the end of their last job if they held one within the past 12 months), but who are not currently looking because they believe there are no jobs available or there are none for which they would qualify are called “discouraged workers”. Furthermore, the BLS states that the unemployment rate does not include these discouraged workers that have not been actively looking for work in the past four weeks, while the labor force participation rate does.

iCharts

Since the beginning of the economic crisis in 2008, the labor force participation rate is showing a decline of the percentage of participating adults in the labor force. As a result of that decline, there are a greater number of discouraged workers that are increasingly likely to be dependent on government support. With greater dependence on government support and fewer productive workers in the market place, a theoretical decline in economic output would follow. Therefore, these graphs conclude that the labor force participation rate can, in many ways, serve as a more accurate metric of economic health.

Datasets:

Data mining in MMORPG

MMORPG (Massively multiplayer online role-playing game) is another kind of social network. But users of this kind of network does noth share personal information so much. In my opinion, the data mining could help the company running these games to improve.

In this blog, I will use World of Warcraft (WOW) which I am familiar with to share my ideas of the potential application of data mining in this area.

I think WOW is the most successful MMORPG in recent years. It went online in different regions around 2004-2006. I began to play it from 2005, now due to my busy life, I have stopped playing. But I am still very interested in it.

In 2009, Blizzard crafted 7,650 quests, 70,000 spells, 40,000 NPCs, 1.5 million assets, and 5.5 million lines of code; some 4,000 employees, 13,250 server blades, and 75,000 CPU cores keep MMORPG running. This number would increase these days. The picture below is screenshots of character summary page, we can see, each character has lot of variables to describe it, and each variable has lots of information. Hence, there is a lot of data for mining to improve this game.

I think mining could be performed in aspects below.

- Tracking Fraud. Usually, user want to cheat or fraud have a lot of characters in level, and the character usually in the "main-cities" of the world in WOW. By filter these features, character with high intention of cheating could be known, and track by Blizzard.

- Optimizing game specifications. This is a big problem, in my opinion which is not including debugging. In the game, characters belongs to each class and race, which make performances different. For example, the dps of warrior is preset, and shaman's is preset too. Blizzard want to make the game "balanced", which mean the game result is determine by players, not the preset specifications. By data mining, they could find result of fighting between this two characters, or using association rules to find which warrior win the fight or kill someone, which class usually paired with warriors. Then, the specifications of that class should be modified.

- Marketing miming. There is a virtual market is the game. Association rules could find which items are always bought together. Then, they may find which specification or which kind of items are most wanted. The new dungeons could offers new items.

These are my ideas in MMORPG data mining. I did search online, their is not enough resource talk about this topic. I guess the reason is no public data is available.

ref:

http://www.gamespot.com/news/blizzard-outlines-massive-effort-behind-world-of-warcraft-6228615

Armchair Activism and an Equals Sign

Undoubtedly, if you are a Facebook user you have been witness to a lot of profile picture changes within the last few weeks. Specifically, around March 26 when the Human Rights Campaign challenged their followers to change their profile pictures to one of the image below (the far left is most popular, and the latter two were for giggles).

In an article posted on the Fast Company website, an overview is given to some of the analysis that the Facebook Data Science Team cooked up. You can find it here. What's especially interesting about this event is how it's given researchers some significant insights into activism and how different demographics respond. The team observed that 120% more users (than the previous Tuesday) changed their profile picture over the course of a day. As you can see below, after applying a time-series model, the data shows a very obvious, positive trend.

The team used the changes to indicate the "stance" of each user on the marriage-equality issue. This resulted in giving the team data on the gender of "activists" as well as their age. Even more interestingly, it gave them geographic information in the form of frequency per county (below). Wouldn't you love to have access to their numbers? To read the full break down, visit here.

Further, as we've discussed in class, there is a lot that one can learn from the images themselves. I was interested that the team didn't do any data extraction on the actual images. I think that one reason might be that as the images were saved and re-saved as they transferred from user to user, the quality of the picture degraded (as you can see below). Thus, pixel data may have been skewed. But, I ask the question because I observed a lot of people I know changing their profile pictures in support of Proposition 8 (the legislation in question) which is in opposition of equal marital rights (as defined as man and woman). So, the mere fact that profile pictures were changing doesn't necessarily (to me) represent a full indication of the frequency of support of one side or the other. My observations were that people were changing their picture (in large part) as a response to what others were doing.

And you also have people (like me) that chose to use the tense climate to recognize things that are TRULY significant... like the fact that April is Mathematics Appreciation Month which coincidentally had a strong association with the symbols being used in this virtual human rights rally.

In closing, I think that the truly significant and telling statistics would be things like:

The amount of people that changed their profile picture and are registered to vote.
Or, that have ever written/called/have heard of their state elected officials.
Or, made any other action whatsoever outside of clicking "edit profile picture".

Pardon my cynicism, and let me explain. I have seen time and again microcosms of the same event transpire on campus. We have a very active and vocal student body that have some great things to say in regards to: tuition, state appropriations, academic excellence, etc. However, if I were to weigh the amount of times I've seen people post an uninformed, aggressive comment on Social Media against the amount of times I've seen said individuals at a Board of Trustee meeting, SGA Senate meeting, or University Senate meeting... the scale would bottom-out. I wholeheartedly believe that for our country to move past this climate of bipartisanship, we will have to engage in informed, healthy debate. And while social media is an excellent platform for this to take place, ultimately policy is decided by appointed officials so it's our duty to first be an informed electorate, vote for the best candidate, and hold them accountable to their actions by keeping our voice known to them directly.

Friday, April 5, 2013

Binary search trees: structure for data retrieval

If we recall from earlier in the semester, we discussed a method for discovering the similarity of two different documents through a process that went through shingling, minhashing, and then local sensitive hashing. Briefly we discussed hash tables and how they are used in mapping certain keys to values into a “signature matrix”. In homework 3 question two, we were required to perform a minhash on a data table given a certain order of rows. Essentially, it was a method for data retrieval. I would like to introduce a new type of data structure called a binary search tree which has certain characteristics that might be more useful for data retrieval than a typical hash table when searching under certain conditions.

I have drawn most of my understand on binary search trees from this video lecture from UC Berkeley's computer science department. Essentially, a binary search tree is a type of data structure which can be described as having 4 specific characteristics. Directly from Wikipedia, those characteristics are the following:

The left subtree of a node contains only nodes with keys less than the node's key.
The right subtree of a node contains only nodes with keys greater than the node's key.
Both the left and right subtrees must also be binary search trees.
There must be no duplicate nodes.

To clear up some of the jargon, essential a key is some predefined value for which the search is based on. In the video this value is numeric. A node is a position along the tree which has a key value assigned to it.

Say a search is running on some binary search tree, it will begin with the top , or root node, will check the key value of that root node, and decide if the key value being searched for is greater than or equal to the key of that node. If it is less, the search will go to the bottom left sub-tree and if the key is greater, it will go to the bottom right sub-tree. After the search is sitting on the new node, it will repeat the previous process until it finds the key. If it doesn't find the key, it will return a null, and most algorithms can be written so that if a null is returned, the key that was being searched on can be indexed into where the null was previously located.

The interesting thing about binary trees vs hash tables is that a binary tree is better suited to find inexact matches as where hash tables can more efficiently withdraw exact matches to some particular key. In the video, the lecturer mentions two keys that will be found if they exist in the tree given that someone is searching for some arbitrary key with value k which does not exist in the tree. These two nodes are as follows.

Node containing the smallest key value greater than k.
Node containing the largest key value less than k.

If you pay attention to the lecture at time 20:11 in the video, the lecturer gives a more visual explanation of what I just described. Imagine you had a binary tree full of information and were searching for a key value you knew did not exist, a standard find algorithm run across this type of data structure will find, if they exist, the two nodes containing key values closest to the one which was being searched for. It is possible that meaningful information might exist in those two nodes. Hence, from my take on the advantage of binary search trees, the characteristics as a type of data structure good for finding inexact but potentially very closely related information is demonstrated. Please watch the video for a more in depth understanding.

Sources:

http://en.wikipedia.org/wiki/Binary_search_tree

Civil Engineering and Data Mining

Data Mining (DM) is a multi-disciplinary field and encompasses techniques from a number of fields, including information techniques, statistical analysis, machine learning (ML), pattern recognition, artificial intelligence (AI) and database management.

Reinforced concrete is a widely used construction material. Its properties depend on the bond between the reinforcing bar and concrete as much as the compressive strength or properties of the reinforcing bar because of component of construction expose to both flexural and bond together compressive loads. The compressive and flexural strength properties of the reinforcing bar are taken as the basis of a construction design. Constructions or buildings are not only exposed to compressive, flexural or tensile loads. Particularly, reinforced constructions are exposed mostly to these loads. In addition to these loads, there are a variety of affects such as the bond or flexural bond and the performance of reinforced concrete structures, which depend on adequate bond strength between the concrete and the rebar. Bond strength is one of the most important properties that control the behavior of reinforced concrete structures. However, the determination of its effects requires special equipment.

To determine the bond properties, the bond characteristics between the concrete and reinforcement are commonly used through pull-out, push-in, and related testing methods. The pull-out test is the easiest and oldest of these tests. The relationships between obtained data cannot always be linear. Sometimes these relationships are non-linear or cannot easily be understood. Quantitative models can be defined using statistical approaches or machine learning based data mining approaches for bond properties.

Data mining (DM), also known as Knowledge Discovery Data (KDD), is the process of analyzing data from different angles and summarizing it into useful information. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, DM is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that examine through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely be generalized to make accurate predictions on future data.

Recently, many have tried to apply optimization models to DM and numerous models have been proposed for classification, clustering, and other DM functionalities which have enhanced both the theoretical foundation and practical applications of DM in different scientific fields, such as social or education science, marketing, communications and engineering science. The DM process can be used to estimate relationships between bond and flexural bond properties and the flexural strength, compressive strength and tensile stress of the rebar. In order to find these properties, algorithms in WEKA (Waikato Environment for Knowledge Analysis) can be used in a DM process.

References: Modeling by data mining process ” Civil Engineering Department, Faculty of Engineering and Architectural, Suleyman Demirel University, Turkey, Department of Construction Education, Faculty of Technical Education, Suleyman Demirel University, Turkey”.