Friday, February 15, 2013

An example of Apriori algorithm

This tutorial/example below would help you understand the Apriori algorithm. Let us consider a simple example of the items being purchased at a store with different transaction ID and find out the frequently bought items at the store.
Let us set the support threshold to be 3.
Transaction ID
Items Bought
T1
{Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T2
{Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T3
{Mango, Apple, Key-chain, Eggs}
T4
{Mango, Umbrella, Corn, Key-chain, Yo-yo}
T5
{Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}
So for here it should be bought at least 3 times.

For simplicity
M = Mango
O = Onion
And so on……

So the table becomes

Original table:
Transaction ID
Items Bought
T1
{M, O, N, K, E, Y }
T2
{D, O, N, K, E, Y }
T3
{M, A, K, E}
T4
{M, U, C, K, Y }
T5
{C, O, O, K, I, E}


Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought 4 times in total, but, it occurs in just 3 transactions.

Item
No of transactions
M
3
O
3
N
2
K
5
E
4
Y
3
D
1
A
1
U
1
C
2
I
1


Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times. So in this step we remove all the items that are bought less than 3 times from the above table and we are left with

Item
Number of transactions
M
3
O
3
K
5
E
4
Y
3


This is the single items that are bought frequently. Now let’s say we want to find a pair of items that are bought frequently. We continue from the above table (Table in step 2)

Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with the second item like OK,OE,OY. We did not do OM because we already did MO when we were making pairs with M and buying a Mango and Onion together is same as buying Onion and Mango together. After making all the pairs we get,

Item pairs
MO
MK
 ME
MY
OK
OE
OY
KE
KY
EY


Step 4: Now we count how many times each pair is bought together. For example M and O is just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get

Item Pairs
Number of transactions
MO
1
MK
3
ME
2
MY
2
OK
3
OE
3
OY
2
KE
4
KY
3
EY
2


Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less than three and we are left with

Item Pairs
Number of transactions
MK
3
OK
3
OE
3
KE
4
KY
3

These are the pairs of items frequently bought together.
Now let’s say we want to find a set of three items that are brought together.
We use the above table (table in step 5) and make a set of 3 items.

Step 6: To make the set of three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we get
·         OK and OE, this gives OKE
·         KE and KY, this gives KEY

Then we find how many times O,K,E are bought together in the original table and same for K,E,Y and we get the following table

Item Set
Number of transactions
OKE
3
KEY
2

While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate item sets of 4 items you look for two sets having the same first two alphabets.
·         ABC and ABD -> ABCD
·         ACD and ACE -> ACDE

And so on … In general you have to look for sets having just the last alphabet/item different.

Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3 times which leaves us with just OKE, Since KEY are bought together just two times.

Thus the set of three items that are bought together most frequently are O,K,E.


4 comments:

  1. What if I buy two onions in each transaction, how do I represent it in step 2? should I indicate it as two O's for onions? Can you please work out association rules for the same problem with two onions in each transaction.

    ReplyDelete
    Replies
    1. Sam,

      Typically, the focus is on sets so it does not make a difference.It you want to consider bags of items (instead of sets and you can extend the logic above to achieve this).

      Fadel

      Delete
  2. The algorithm uses the minimum support value to create a set of common repetitive elements as the criterion. The minimum support level depends on the context. If an item-set has a frequency larger than minimum support, then it is called as a frequent item-set. In addition, any subset if a frequent item-set is also a frequent item-set. So, the amount of how many times an element takes place in a set is not important for this algorithm.

    ReplyDelete
  3. how to calculate chi-square and lift correlation if i split them to two categories
    cat1{mango-apple-keychain}
    cat2{onion-eggs}
    thanks

    ReplyDelete