Friday, February 1, 2013

How to define Date attributes in RapidMiner


In this post I am going to show a method to prepare the OSHA data for text mining in RapidMiner. The first step is to import the raw data into the RapidMiner. There are different ways to import the data into the RapidMiner. In this post, we are going to use ‘Import CSV’ function. In Repositories area, the second icon from the left contains the ‘Import CSV File’ option. Click on it as indicated by the red rectangular in the figure 1.
Figure 1.

When the Data Import Wizard opens, navigate to the location where you have stored the OSHA data set, then click Next.
Figure 2.

In the second step choose the comma for column separation, as indicated in figure 3, then click Next.
Figure 3.

At step 3, we are able to identify the first row as the attributes’ name. Click on the annotation drop box on the first row and select ‘Name” as indicated in figure 4. Click Next.
Figure 4.

In the next window, we are able to define the data type of each attribute. RapidMiner proposes its best guess for each attribute and we have the options to accept the proposed data types and change them later by Operators or modify them now. Let’s change them at this step. We know that the second and third columns are date type, so  change the attribute type of these columns to ‘Date’ and then define the date format in the highlighted box in figure 5. RapidMiner proposes some predefined date and time format which can be selected from the drop box, but in this case, none of these options match our date format; therefore, we need to define our desired format. As it is indicated in figure 5, the format of our date columns are ‘MM/dd/yyyy’, so type this format in the “Date format” box. Notice that, you should enter the month in upper case and the days and years in lower cases, Otherwise RapidMiner will not distinguish the months and consider all months as the January.
You see there is a check box at above each attribute. If you do not want to import a particular column, you may just uncheck its box at this step.


Figure 5.

The final step is to store the data set in Repository folder and give it a name.

Figure 6.

Now, we can see that the OSHA data set is available under Repository folder. To add this data set to our model, we should drag the OSHA icon and release it at the Process windows. Your Process windows should look like figure 7.
Figure 7

Now, if you run the model the results should look like figures 8 and 9.
Figure 8 



Figure 9
As you can see, the attributes "Summary Report Date" and "Date of Incident" are date types and the last two attributes are text types.
As I mentioned before, if you accept the default attribute types while importing the data set into the RapidMiner, you always have the chance to change their type or their role later in your model. In Operators area, under ‘Data Transformation’ folder tree, you will see the “Type Conversion” folder which contains different Operators to convert various attribute types.
Figure 10

Assume that for our analysis, we need to extract the incident months. To do this, drag the “Date to Numerical “ Operator to the Process window and set up its properties as it indicates in figure 11.
Figure 11


Since we checked the box “Keep old attribute” , Rapid miner keeps the old attribute and add the new attribute to our model. We may like to rename the default name proposed by RapidMiner. In the search box of the Operators area type “Rename” and then add the Rename operator to our model and set its properties as it indicates in figure 12.
Figure 12

Run the model and make sure your results is look like figures 13 and 14.
Figure 13

Figure 14





3 comments:

  1. Shahab,

    Thank you for putting this together.

    Fadel

    ReplyDelete
  2. Hi, have you ever tried to get the name of the month ? or the day ? for example get the "July" string in another column

    thank you

    ReplyDelete
  3. https://auburnbigdata.blogspot.com/2013/02/count-words-with-amr.html?showComment=1564726187313#c8209480406364372152

    ReplyDelete