Friday, February 8, 2013

Simple model to generate association rules in RapidMiner


In this post, I am going to show how to build a simple model to create association rules in RapidMiner. To demonstrate the process, I created an example based on the Health Care example presented in the page 6 of the 8th lecture material. In this example, the possibility of having two different side effects is considered based on consuming a combination of 6 different drugs. At first, the mentioned table was generated in CSV format and then it was imported to the RapidMiner. As it can be seen in figure 1 and 2 the input table has 9 attributes which all are binomial types except the PID attribute which is integer. 
Figure 1.
Figure 2.
For purpose of rule generation, we need to use FP-Growth operator which just accepts the binomial attributes. Since, we do not need PID attribute in our model, we are going to exclude it by using Select Attributes operator. Add Select Attributes operator to the process window and connect it to the input data. In Attribute Filter Type drop box select Subset and press the Select Attributes button. The Select attributes windows is displayed like figure 3.
Figure 3
Add all binomial attributes to the Selected Attribute window as indicated in figure 4.
Figure 4
In the search field in the operator tab, search for FP-Growth Operator and add it to your model. The FP in FP-Growth stands for Frequency Pattern. Frequency pattern analysis is used for many kinds of data mining, and is a necessary component of association rule mining. Without having frequencies of attribute combinations, we cannot determine whether any of the patterns in the data occur often enough to be considered rules. One important parameters of this operator is Min Support. It is the number of times that the rule did occur, divided by the number of observations in the data set. for this example we leave its default value.
Figure 5
Run the model and switch to the result window (Figure 6). It seems that some of our attributes appear to have some frequent patterns in them. In fact, in this example lots of frequent patterns are observed because our example has a few data. If your model does not generate any frequent pattern, you may need to adjust the Min Support percent and decrease it until you get the reasonable response.
We can investigate the possible connection further by adding one final operator to our model.
Figure 6
In the search field in the operator lab, search for Create Association Rules operator and drag it to your model, as illustrated in figure 7. This operator takes in frequent pattern matrix data and seeks out any patterns that occur so frequently that could be considered as rules. The Create Association Rules operator generates both a set of rules (through the rul port) and a set of associated items (through the ite port).  In this model we are looking just for generating rules, so we simply connect the its rul port to the res port of the process window.
Figure 7
One of the influential parameters of this operator is Min Confidence. Confident percent is a measure of how confident we are that when one attribute is flagged as true, the associated attribute will also be flagged as true. It is gained by dividing the number of times that a rule occurs by the number of times that it could have occurred. If your model does not generate any rules you may need to decrease the confidence percent. In this example, since we have used just limited numbers of non-real input data, we ended up with lots of rules with high confident percentages which is not a case in real world problem.
As you see in figure 8, lots of redundant rules were generated which have either side effects (SE#) in Premises or drugs (D#) in Conclusion. By looking at the input data, one can say that the correct rules should contain drugs in Premises and side effects in Conclusion. So, as you see in the below picture, just this type of rules is highlighted by red arrows and the rest are redundant,
Figure 8
Again one should take it into consideration that the limited number of non-real input data and special types of data in this example led to generating lots of rules with high levels of confidence and support. In a real problem this is not the case and usually limited numbers of rules are generated with high or moderate levels of confidence and support.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thank you for this great tutorial!
    I need to creat a association rule between texts(string) not binomial/boolean variables. Is there a way to do that?

    ReplyDelete