Frequent pattern mining is an important subcategory under the subject of Data Mining. Apriori Algorithm is one of the basic and famous approach used in mining frequent patterns. This article will explain how this algorithm works using a simple example.
First we need to clarify two terms that used in this field, Support and Confidence. We will use the following example (Table 1) to clarify things easily. This table contains a list of items bought from a supermarket. A,B,C… code names are used to identify items and Transaction ID correspond to a one transaction done by a customer.
Transaction ID | Items Bought |
---|---|
T100 | M,O,N,K,E,Y |
T200 | D,O,N,K,E,Y |
T300 | M,A,K,E |
T400 | M,U,C,K,Y |
T500 | C,O,O,K,I,E |
Table 1
Support
Support is the probability of having expected k-item set in a transaction. This is usually given as a percentage, fraction(probability) or as an integer(occurrences) This is denoted by sup(X). As an Example,
sup(M) =3/5 = 0.6 ( 60% or 3)
sup(M,E) = 2/5 = 0.4 (40% or 2 )
Confidence
Confidence is the conditional probability that a transaction having X also contains Y. This is also given as percentage, fraction or integer.
Confidence(X=>Y) = sup(X ∩ Y)/sup(X)
As an example,
Confidence (K=>Y) = (3/5) /(5/5) =3/5 =0.6 (60% or 3)
Apriori-Pruning Principal
When understanding the method of Apriori Algorithm, it is important to understand Apriori-pruning Principal.
If there is any item set which is infrequent, its superset should not be generated or tested.
This simply means if A is infrequent we can avoid checking A,B as a frequent item set.
Method
Step 1 :
Scan Database for frequent 1-item sets(k=1 initially). Remove item sets which does not satisfy the Minimum Support requirement.
For the above example case,lets assume minimum support is 60% and minimum confidence is 80%.
1-item set | Support |
---|---|
M | 60% |
O | 60% |
N | 40% |
K | 100% |
E | 80% |
Y | 60% |
D | 20% |
A | 20% |
U | 20% |
C | 20% |
I | 20% |
Table 2
From this table we remove 1-item sets with less support than required minimum support. After that operation we can have a table like this. (Table 3)
1-item set | Support |
---|---|
M | 60% |
O | 60% |
K | 100% |
E | 80% |
Y | 60% |
Table 3
Step 2 :
Generate 2-item sets from frequent 1-item sets.
By looking at table 3 we can see that possible combinations for 2-item sets are MO, MK,ME,MY,OK,OE,OY,KE,KY and EY.
Then we need to calculate support for them as well. (see table 4)
2-item set | Support |
---|---|
MO | 20% |
MK | 60% |
ME | 40% |
MY | 40% |
OK | 60% |
OE | 60% |
OY | 40% |
KE | 80% |
KY | 60% |
EY | 40% |
Table 4
Remove the item sets with less support than required minimum support. When we do that we will have the table 5.
2-item set | Support |
---|---|
MK | 60% |
OK | 60% |
OE | 60% |
KE | 80% |
KY | 60% |
Table 5
Step 3 :
Generate K+1 -item set from k-item set. If a particular k+1-item set contains a k-item subset which does not belongs to previous table with k-item sets which satisfy minimum support remove them. (According to Apriori-prune Principal). Then check for enough support as usual and remove item sets with not enough support. We will continue this process till no frequent candidate set is generated.
All possible 3 item sets for the above example are KMO, EKM, KMY and EOK. KMO cannot be in the list since MO or KO are not in the 2-item list. Same thing applies to EKM and KMY. EKO is left in the table since all possible 2-item sets(KE,KO and EO) are in the list.
Then we calculate support for the 3-item set we have. (See table 6)
3-item set | Support |
---|---|
EKO | 60% |
Table 6
EKO have the minimum support that requires. So it remains in the table further.
It is clear that we cannot generate 4-item set from this table since we only have one 3-item set. So we terminate from this process.
Thank you for reading this post. I will soon create a new post to show how confidence is affected to this process and how it is used practically.