In this post I am going to show a method to prepare the OSHA data
for text mining in RapidMiner. The first step is to import the raw data into
the RapidMiner. There are different ways to import the data into the RapidMiner.
In this post, we are going to use ‘Import CSV’ function. In Repositories area,
the second icon from the left contains the ‘Import CSV File’ option. Click on it
as indicated by the red rectangular in the figure 1.
Figure 1.
When the Data Import Wizard opens, navigate to the
location where you have stored the OSHA data set, then click Next.
Figure 2.
In the second step choose the comma for column separation,
as indicated in figure 3, then click Next.
Figure 3.
At step 3, we are able to identify the first row as the attributes’
name. Click on the annotation drop box on the first row and select ‘Name” as
indicated in figure 4. Click Next.
Figure 4.
In the next window, we are able to define the data type of each
attribute. RapidMiner proposes its best guess for each attribute and we have
the options to accept the proposed data types and change them later by Operators
or modify them now. Let’s change them at this step. We know that the second and
third columns are date type, so change
the attribute type of these columns to ‘Date’ and then define the date format
in the highlighted box in figure 5. RapidMiner proposes some predefined date
and time format which can be selected from the drop box, but in this case, none
of these options match our date format; therefore, we need to define our
desired format. As it is indicated in figure 5, the format of our date columns
are ‘MM/dd/yyyy’, so type this format in the “Date format” box. Notice
that, you should enter the month in upper case and the days and years in lower
cases, Otherwise RapidMiner will not distinguish the months and consider all months
as the January.
You see there is a check box at above each attribute. If you do not
want to import a particular column, you may just uncheck its box at this step.
Figure 5.
The final step is to store the data set in Repository
folder and give it a name.
Figure 6.
Now, we can see that the OSHA data set is
available under Repository folder. To add this data set to our model, we should drag the OSHA icon and release it at the Process windows. Your Process
windows should look like figure 7.
Figure 7
Now, if you run the model the results should
look like figures 8 and 9.
Figure 8
Figure 9
As you can see, the attributes "Summary Report Date" and "Date of
Incident" are date types and the last two attributes are text types.
As
I mentioned before, if you accept the default attribute types while importing the
data set into the RapidMiner, you always have the chance to change their type
or their role later in your model. In Operators area, under ‘Data Transformation’
folder tree, you will see the “Type Conversion” folder which contains different
Operators to convert various attribute types.
Figure 10
Assume that for our analysis, we need to extract the incident
months. To do this, drag the “Date to Numerical “ Operator to the Process
window and set up its properties as it indicates in figure 11.
Figure 11
Figure 12
Run the model and make sure your results is look
like figures 13 and 14.
Figure 13
Figure 14
Shahab,
ReplyDeleteThank you for putting this together.
Fadel
Hi, have you ever tried to get the name of the month ? or the day ? for example get the "July" string in another column
ReplyDeletethank you
https://auburnbigdata.blogspot.com/2013/02/count-words-with-amr.html?showComment=1564726187313#c8209480406364372152
ReplyDelete