Showing posts with label RapidMiner. Show all posts
Showing posts with label RapidMiner. Show all posts

Sunday, April 21, 2013

Association Rules with RapidMiner



Here is a video I made on how to do Association rules among text documents in RapidMiner

Thursday, April 4, 2013

Text mining techniques used for hiring process


This tutorial is based on resume sorting and clustering using text mining techniques. The crucial function of every company is hiring new individuals. The pool of resumes a company receives during recruitment are way higher than number of person assigned. Text mining technique is required in order to sort and filter keywords such as Internships, relevant skills, experiences, etc. Based on those keywords, various categories can be defined and resumes can be categorized ultimately leading to selection of better individuals. The video posted below shows different techniques used to filter resumes.


Wednesday, April 3, 2013

Web Crawling with RapidMiner



For this blog post I am going to show you how to use RapidMiner to crawl a webpage for you.  First, when you open up RapidMiner you have to make sure you have the Web Mining extension installed.  If not, click on the Help menu at the top of the screen and click on "Update RapidMiner"
Then select and download the Web Mining Extension.

Once you have the Web Mining Extension downloaded, open the Web Mining folder under the Operators sections and then select and drag Crawl Web onto the Process section.



Once you have done this, you have to chose a website to crawl. How about we crawl this very blog.  So, we copy and paste the url into the url box on the right side of the screen, under the parameters tab.


Then, you have to select an output directory for RapidMiner to save your files to. I've just chosen a folder on my desktop called "tmp". Then you want to select a file extension, I've chosen .txt.  RapidMiner will save the files it crawls as text files.  The Max depth is how many consecutive links the crawl will follow, I've chosen the default of 2.  The domain tells you if the crawl will stay on the same server or allow it the crawl the entire web, I've left it as the default of web.  Also I set the max threads, which is the number of CPU cores the crawl will use, I have set it to 4 in order to speed up the crawl.  Then I have changed the user agent to that of my browser, to do this just go to http://whatsmyuseragent.com/ and copy and paste your user agent into the box. 
 
Now we need to set up some crawling rules. So, click on the button next to crawling rules that says "Edit lists" 
Then a dialogue box will open.
As you see, you can add crawling rules.  The 1st rule establishes which links for the crawl to follow.  I have set it up to .+auburnblogspot.+.  This allows for it to follow any link with auburnblogspot in the url. The .+ says any amount of characters before and after auburnblogspot to be in the url.  The 2nd rule saves only the pages that have auburnblogspot in the url.

Ok, everything should be set.  Hit the play button, RapidMiner will crawl the web and save the webpages to the file you specified.  Then you can perform your analysis on your saved files.