Analytics and Visualization of Big Data: Association Rules

Friday, April 5, 2013

Binary search trees: structure for data retrieval

If we recall from earlier in the semester, we discussed a method for discovering the similarity of two different documents through a process that went through shingling, minhashing, and then local sensitive hashing. Briefly we discussed hash tables and how they are used in mapping certain keys to values into a “signature matrix”. In homework 3 question two, we were required to perform a minhash on a data table given a certain order of rows. Essentially, it was a method for data retrieval. I would like to introduce a new type of data structure called a binary search tree which has certain characteristics that might be more useful for data retrieval than a typical hash table when searching under certain conditions.

I have drawn most of my understand on binary search trees from this video lecture from UC Berkeley's computer science department. Essentially, a binary search tree is a type of data structure which can be described as having 4 specific characteristics. Directly from Wikipedia, those characteristics are the following:

The left subtree of a node contains only nodes with keys less than the node's key.
The right subtree of a node contains only nodes with keys greater than the node's key.
Both the left and right subtrees must also be binary search trees.
There must be no duplicate nodes.

To clear up some of the jargon, essential a key is some predefined value for which the search is based on. In the video this value is numeric. A node is a position along the tree which has a key value assigned to it.

Say a search is running on some binary search tree, it will begin with the top , or root node, will check the key value of that root node, and decide if the key value being searched for is greater than or equal to the key of that node. If it is less, the search will go to the bottom left sub-tree and if the key is greater, it will go to the bottom right sub-tree. After the search is sitting on the new node, it will repeat the previous process until it finds the key. If it doesn't find the key, it will return a null, and most algorithms can be written so that if a null is returned, the key that was being searched on can be indexed into where the null was previously located.

The interesting thing about binary trees vs hash tables is that a binary tree is better suited to find inexact matches as where hash tables can more efficiently withdraw exact matches to some particular key. In the video, the lecturer mentions two keys that will be found if they exist in the tree given that someone is searching for some arbitrary key with value k which does not exist in the tree. These two nodes are as follows.

Node containing the smallest key value greater than k.
Node containing the largest key value less than k.

If you pay attention to the lecture at time 20:11 in the video, the lecturer gives a more visual explanation of what I just described. Imagine you had a binary tree full of information and were searching for a key value you knew did not exist, a standard find algorithm run across this type of data structure will find, if they exist, the two nodes containing key values closest to the one which was being searched for. It is possible that meaningful information might exist in those two nodes. Hence, from my take on the advantage of binary search trees, the characteristics as a type of data structure good for finding inexact but potentially very closely related information is demonstrated. Please watch the video for a more in depth understanding.

Sources:

http://en.wikipedia.org/wiki/Binary_search_tree

Friday, February 8, 2013

A Follow on Similarity and Association Rules Heart Monitor Example

The last two lectures have been focused on Association Rules and Similarity. During the class, there seemed to be some confusion relating the theory behind Association Rules and finding the frequent pairs. On Thursday (02/07/13), Dr. Megahed introduced the Chapter 3 topic of finding similar items. He gave a real world Industrial Engineering application of similarity with a Coordinate Measuring Machine (CMM) that measures a car door for defects. The CMM measures the car door at 11 distinct places to determine if the gap between the door and car was sufficient and within spec. He then broke down the differences in Similarity and Association Rules when inspecting the car door. The Association Rules included a 3D scanner to identify each doors unique association. With this blog I will attempt to give another real work example to hopefully clarify how Association Rules exist in the Industrial Engineering Community.

Throughout the years, people had to visit the doctor or special clinics with special equipment to monitor their heart rate, blood pressure, and other heart related metrics. These visits were usually scheduled and preplanned. Most visits returned no abnormalities and everything was ok, which simulates the similarity example.

Similarity

With the routine visit to the doctor, one can only monitor if there is a fault or abnormality at that specific time. You can't verify your hearts activity since the last visit or until the next visit.

With the advancement in smart phone technology, AliveCor has come out with an iPhone case that offers real-time EKG readings. The user must open up the app and place the case in their hands or on their chest for 30-second intervals. There are also continous monitoring capabilities out there from Everist Genomics. A wrist band that attaches electrodes to a patient's skin offers around-the-clock arrhythmia detection. The wrist band, with a built in Bluetooth device, collects the heart data, and the app on the smart phone analyzes the information and sends it to your physician for constant monitoring.

Association Rules

This continous monitoring is a good example of Association. With the smart phone, the doctor could detect any variation patterns and hopefully eliminate critical defects and abnormalities. After several months, the physician could use the Association Rules data to better predict when regular scheduled checkups should be planned.