Wednesday, April 3, 2013

Web Crawling with RapidMiner



For this blog post I am going to show you how to use RapidMiner to crawl a webpage for you.  First, when you open up RapidMiner you have to make sure you have the Web Mining extension installed.  If not, click on the Help menu at the top of the screen and click on "Update RapidMiner"
Then select and download the Web Mining Extension.

Once you have the Web Mining Extension downloaded, open the Web Mining folder under the Operators sections and then select and drag Crawl Web onto the Process section.



Once you have done this, you have to chose a website to crawl. How about we crawl this very blog.  So, we copy and paste the url into the url box on the right side of the screen, under the parameters tab.


Then, you have to select an output directory for RapidMiner to save your files to. I've just chosen a folder on my desktop called "tmp". Then you want to select a file extension, I've chosen .txt.  RapidMiner will save the files it crawls as text files.  The Max depth is how many consecutive links the crawl will follow, I've chosen the default of 2.  The domain tells you if the crawl will stay on the same server or allow it the crawl the entire web, I've left it as the default of web.  Also I set the max threads, which is the number of CPU cores the crawl will use, I have set it to 4 in order to speed up the crawl.  Then I have changed the user agent to that of my browser, to do this just go to http://whatsmyuseragent.com/ and copy and paste your user agent into the box. 
 
Now we need to set up some crawling rules. So, click on the button next to crawling rules that says "Edit lists" 
Then a dialogue box will open.
As you see, you can add crawling rules.  The 1st rule establishes which links for the crawl to follow.  I have set it up to .+auburnblogspot.+.  This allows for it to follow any link with auburnblogspot in the url. The .+ says any amount of characters before and after auburnblogspot to be in the url.  The 2nd rule saves only the pages that have auburnblogspot in the url.

Ok, everything should be set.  Hit the play button, RapidMiner will crawl the web and save the webpages to the file you specified.  Then you can perform your analysis on your saved files.

14 comments:

  1. Thanks for your post. Sometimes RapidMiner is lack of some functions in web crawling. This function in Rapidminer can be used only if the url remains same after entering to another webpages. Lets use our blog as the example. Our posts belongs to "http://auburnbigdata.blogspot.com/", and we can easily use web crawling to grab posts by using your crawling rules. If we want to grab somethingelse from, for example, amazon.com. Then, it is hard to tell what kind of crawling rules that we should use. Amazon uses different platform to store webpages. In this case, we have to use other tools which are more complicated to crawl the data we want.

    ReplyDelete
  2. Thanks for this post.

    Need quality web data to improve your workflow and design? Algoscale can upgrade your operations and processes while at the same time reducing your web crawling and data mining expenses.
    Please visit http://www.algoscale.com/ and contact the site owner if you have questions.

    ReplyDelete
  3. Hi.. I have followed the steps mentioned in the blog but i'm getting an error -
    Failed to create directory: C:\Users\krijain\Desktop\Web Crawl
    Please check if you have permissions to create a directory in the specified location
    I'm not sure what went wrong. Anybody please help?

    ReplyDelete
  4. Hi, is there a way to skip pages that require authentication?

    ReplyDelete
  5. Our online marketer and SEO experts offer a full range of online marketing (SEO) Services to assist you in attaining more visibility within the main SEO and gain more web contact for your target viewers. best web design software pakistan

    ReplyDelete
  6. Hi, I have followed all the instructions, but there are no webpage files. RM contains nothing. Not sure where I went wrong or what the outcome should contain. Thanks.

    ReplyDelete
  7. I am new to this software. I got the same issue as Kate did. I followed all the steps, but didn't get any output. Did I do anything wrong? Can anyone help? I am using version 6.3.000.
    Thanks,

    ReplyDelete
  8. Delta Decisions Inc. is a premiere web development firm specializing in e-commerce and business websites. We offer services including website design, web hosting, SEO servives, pay-per-click, google adwords, website branding and much more.
    e-commerce Toronto

    ReplyDelete
  9. This is a great list to use in the future..
    سئو

    ReplyDelete
  10. If you set out to make me think today; mission accomplished!
    seo backlinks service

    ReplyDelete
  11. I just wanna say thanks for the writer and wish you all the best for coming!.
    High Trus Flow and Citation Flow backlinks

    ReplyDelete
  12. Web Scraping Services or website scraping service is like a boon to grow business and reach your business to new heights and success. Website scraping services is nothing but a process of extracting data from website for your business need.

    ReplyDelete
  13. Looking at ranking 1st in Google? Looking to grow your business? Look no further, we are Engage Online, an SEO Agency from Australia. Our SEO services maximise business marketing, boost traffic to websites, and raise Google Ranking
    SEO Australia

    ReplyDelete