This tutorial/example below would help
you understand the Apriori algorithm. Let us consider a simple example of the
items being purchased at a store with different transaction ID and find out the
frequently bought items at the store.
Let us set the support threshold to be
3.
Transaction ID
|
Items Bought
|
T1
|
{Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
|
T2
|
{Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
|
T3
|
{Mango, Apple, Key-chain, Eggs}
|
T4
|
{Mango, Umbrella, Corn, Key-chain, Yo-yo}
|
T5
|
{Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}
|
So for here it should be bought at
least 3 times.
For simplicity
M = Mango
O = Onion
And so on……
So the table becomes
Original table:
Transaction ID
|
Items Bought
|
T1
|
{M, O, N, K, E, Y }
|
T2
|
{D, O, N, K, E, Y }
|
T3
|
{M, A, K, E}
|
T4
|
{M, U, C, K, Y }
|
T5
|
{C, O, O, K, I, E}
|
Step 1: Count the
number of transactions in which each item occurs, Note ‘O=Onion’ is
bought 4 times in total, but, it occurs in just 3 transactions.
Item
|
No of transactions
|
M
|
3
|
O
|
3
|
N
|
2
|
K
|
5
|
E
|
4
|
Y
|
3
|
D
|
1
|
A
|
1
|
U
|
1
|
C
|
2
|
I
|
1
|
Step 2: Now remember we
said the item is said frequently bought if it is bought at least 3 times. So in
this step we remove all the items that are bought less than 3 times from the
above table and we are left with
Item
|
Number of transactions
|
M
|
3
|
O
|
3
|
K
|
5
|
E
|
4
|
Y
|
3
|
This is the single items that are
bought frequently. Now let’s say we want to find a pair of items that are
bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs
from the first item, like MO,MK,ME,MY and then we start with the second item
like OK,OE,OY. We did not do OM because we already did MO when we were making
pairs with M and buying a Mango and Onion together is same as buying Onion and Mango
together. After making all the pairs we get,
Item pairs
|
MO
|
MK
|
ME
|
MY
|
OK
|
OE
|
OY
|
KE
|
KY
|
EY
|
Step 4: Now we count how many
times each pair is bought together. For example M and O is just bought together
in {M,O,N,K,E,Y}
While M and K is bought together 3
times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we
get
Item Pairs
|
Number of transactions
|
MO
|
1
|
MK
|
3
|
ME
|
2
|
MY
|
2
|
OK
|
3
|
OE
|
3
|
OY
|
2
|
KE
|
4
|
KY
|
3
|
EY
|
2
|
Step 5: Golden rule to the
rescue. Remove all the item pairs with number of transactions less than three
and we are left with
Item Pairs
|
Number of transactions
|
MK
|
3
|
OK
|
3
|
OE
|
3
|
KE
|
4
|
KY
|
3
|
These are the pairs of items frequently
bought together.
Now let’s say we want to find a set of
three items that are brought together.
We use the above table (table in step
5) and make a set of 3 items.
Step 6: To make the set of
three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in
the above table, we find two pairs with the same first Alphabet, so we get
· OK and OE, this gives OKE
· KE and KY, this gives KEY
Then we find how many times O,K,E are
bought together in the original table and same for K,E,Y and we get the
following table
Item Set
|
Number of transactions
|
OKE
|
3
|
KEY
|
2
|
While we are on this, suppose you have
sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate item sets
of 4 items you look for two sets having the same first two alphabets.
· ABC and ABD -> ABCD
· ACD and ACE -> ACDE
And so on … In general you have to look
for sets having just the last alphabet/item different.
Step 7: So we again apply the
golden rule, that is, the item set must be bought together at least 3 times
which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are
bought together most frequently are O,K,E.
What if I buy two onions in each transaction, how do I represent it in step 2? should I indicate it as two O's for onions? Can you please work out association rules for the same problem with two onions in each transaction.
ReplyDeleteSam,
DeleteTypically, the focus is on sets so it does not make a difference.It you want to consider bags of items (instead of sets and you can extend the logic above to achieve this).
Fadel
The algorithm uses the minimum support value to create a set of common repetitive elements as the criterion. The minimum support level depends on the context. If an item-set has a frequency larger than minimum support, then it is called as a frequent item-set. In addition, any subset if a frequent item-set is also a frequent item-set. So, the amount of how many times an element takes place in a set is not important for this algorithm.
ReplyDeletehow to calculate chi-square and lift correlation if i split them to two categories
ReplyDeletecat1{mango-apple-keychain}
cat2{onion-eggs}
thanks