If we recall from earlier in the semester, we discussed a method for discovering the similarity of two different documents through a process that went through shingling, minhashing, and then local sensitive hashing. Briefly we discussed hash tables and how they are used in mapping certain keys to values into a “signature matrix”. In homework 3 question two, we were required to perform a minhash on a data table given a certain order of rows. Essentially, it was a method for data retrieval. I would like to introduce a new type of data structure called a binary search tree which has certain characteristics that might be more useful for data retrieval than a typical hash table when searching under certain conditions.
I have drawn most of my understand on binary search trees from this video lecture from UC Berkeley's computer science department. Essentially, a binary search tree is a type of data structure which can be described as having 4 specific characteristics. Directly from Wikipedia, those characteristics are the following:
- The left subtree of a node contains only nodes with keys less than the node's key.
- The right subtree of a node contains only nodes with keys greater than the node's key.
- Both the left and right subtrees must also be binary search trees.
- There must be no duplicate nodes.
To clear up some of the jargon, essential a key is some predefined value for which the search is based on. In the video this value is numeric. A node is a position along the tree which has a key value assigned to it.
Say a search is running on some binary search tree, it will begin with the top , or root node, will check the key value of that root node, and decide if the key value being searched for is greater than or equal to the key of that node. If it is less, the search will go to the bottom left sub-tree and if the key is greater, it will go to the bottom right sub-tree. After the search is sitting on the new node, it will repeat the previous process until it finds the key. If it doesn't find the key, it will return a null, and most algorithms can be written so that if a null is returned, the key that was being searched on can be indexed into where the null was previously located.
The interesting thing about binary trees vs hash tables is that a binary tree is better suited to find inexact matches as where hash tables can more efficiently withdraw exact matches to some particular key. In the video, the lecturer mentions two keys that will be found if they exist in the tree given that someone is searching for some arbitrary key with value k which does not exist in the tree. These two nodes are as follows.
- Node containing the smallest key value greater than k.
- Node containing the largest key value less than k.
If you pay attention to the lecture at time 20:11 in the video, the lecturer gives a more visual explanation of what I just described. Imagine you had a binary tree full of information and were searching for a key value you knew did not exist, a standard find algorithm run across this type of data structure will find, if they exist, the two nodes containing key values closest to the one which was being searched for. It is possible that meaningful information might exist in those two nodes. Hence, from my take on the advantage of binary search trees, the characteristics as a type of data structure good for finding inexact but potentially very closely related information is demonstrated. Please watch the video for a more in depth understanding.