Monday, February 25, 2013

Jaccard's Similarity

Jaccard’s Similarity Coefficient 
The Jaccard similarity is a common index for binary variables. It is defined as the quotient between the intersection and the union of the pairwise compared variables among two objects. The Jaccard similarity can be used, when interested in binary differences between two or more objects. The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
                                       J(A,B) = {{|A \cap B|}\over{|A \cup B|}}.
Information retrieval, similarities/dissimilarities, finding and implementing the correct measure are at the heart of data mining. An important class of problems that Jaccard similarity addresses well is that of finding textually similar documents in a large corpus such as the Web or a collection of news articles. The following is an article on production data based similarity coefficient versus Jaccard’s similarity coefficient. It is a paper published by professors from the Industrial and Systems Engineering at University of Wisconsin-Milwaukee. A number of machine-component charts taken from the literature or randomly generated are used to form machine-component groups. Then, the sum of intercellular and intracellular material handling costs for each machine-component group is calculated and used as a basis for performance evaluation of the two similarity coefficient. I will not present their equations and various other comparison methods used but just talk about how they implemented their solution methodology. I have provided a link to this paper at the end of the blog. I found this topic to be interesting as it applies data mining tools in the lean manufacturing world and applies to analyze cellular manufacturing techniques. Several performance measures have been developed for the evaluation of cellular manufacturing systems, including the sum of intercellular and intracellular material handling costs,group efficiency, group capability index and so on. The sum of intercellular and intracellular material handling costs is calculated for the two cellular manufacturing system by using Jaccard's similarity coefficient and the production data-based simiarity coefficient. The result is used compare the performance of each similarity coefficient. They have carried out several experimental models using some randomly generated production based data and have presented their results in the paper. This is a primary area of interest for me, so I did some reading on applying Data Analytics to the manufacturing world and stumbled across this paper. Although we have not discussed Jaccard’s Similarity Coefficient in class it is an extended application to the finding similar item sets


No comments:

Post a Comment