In this post, I am going to show how
to build a simple model to create association rules in RapidMiner. To demonstrate
the process, I created an example based on the Health Care example presented in
the page 6 of the 8th lecture material. In this example, the possibility
of having two different side effects is considered based on consuming a
combination of 6 different drugs. At first, the mentioned table was generated
in CSV format and then it was imported to the RapidMiner. As it can be seen in
figure 1 and 2 the input table has 9 attributes which all are binomial types except
the PID attribute which is integer.
Figure 1. |
Figure 2. |
For purpose of rule generation, we
need to use FP-Growth operator which just accepts the binomial attributes. Since,
we do not need PID attribute in our model, we are going to exclude it by using
Select Attributes operator. Add Select Attributes operator to
the process window and connect it to the input data. In Attribute Filter Type
drop box select Subset and press the Select Attributes button. The Select
attributes windows is displayed like figure 3.
Figure 3 |
Add all binomial attributes to the
Selected Attribute window as indicated in figure 4.
Figure 4 |
In the search field in the operator tab, search for FP-Growth Operator and add it to your model. The FP in
FP-Growth stands for Frequency Pattern. Frequency pattern analysis is used
for many kinds of data mining, and is a necessary component of association rule
mining. Without having frequencies of attribute combinations, we cannot
determine whether any of the patterns in the data occur often enough to be
considered rules. One important parameters of this operator is Min Support. It is
the number of times that the rule did occur, divided by the number of
observations in the data set. for this example we leave its default value.
Figure 5 |
Run
the model and switch to the result window (Figure 6). It seems that some of our attributes
appear to have some frequent patterns in them. In fact, in this example lots of
frequent patterns are observed because our example has a few data. If your
model does not generate any frequent pattern, you may need to adjust the Min
Support percent and decrease it until you get the reasonable response.
We
can investigate the possible connection further by adding one final operator to
our model.
Figure 6 |
In the
search field in the operator lab, search for Create Association Rules operator
and drag it to your model, as illustrated in figure 7. This operator takes in
frequent pattern matrix data and seeks out any patterns that occur so
frequently that could be considered as rules. The Create Association Rules
operator generates both a set of rules (through the rul port) and a set of
associated items (through the ite port). In this model we are looking just for generating
rules, so we simply connect the its rul port to the res port of
the process window.
Figure 7 |
One of
the influential parameters of this operator is Min Confidence. Confident percent is a measure of how confident we are that when one attribute is flagged
as true, the associated attribute will also be flagged as true. It is gained by
dividing the number of times that a rule occurs by the number of times that it
could have occurred. If your model does not generate any rules you may need to
decrease the confidence percent. In this example, since we have used just
limited numbers of non-real input data, we ended up with lots of rules with high
confident percentages which is not a case in real world problem.
As you see in figure 8, lots of redundant rules were generated which have either side effects (SE#) in Premises or drugs (D#) in Conclusion. By looking at the input data, one can say that the correct rules should contain drugs in Premises and side effects in Conclusion. So, as you see in the below picture, just this type of rules is highlighted by red arrows and the rest are redundant,
Figure 8 |
Again one should take it into consideration
that the limited number of non-real input data and special types of data in this example led to generating lots of rules with high levels of confidence and support. In a
real problem this is not the case and usually limited numbers of rules are generated with high or
moderate levels of confidence and support.
This comment has been removed by the author.
ReplyDeleteThank you for this great tutorial!
ReplyDeleteI need to creat a association rule between texts(string) not binomial/boolean variables. Is there a way to do that?