Analytics and Visualization of Big Data: Web Crawling with RapidMiner

Wednesday, April 3, 2013

Web Crawling with RapidMiner

For this blog post I am going to show you how to use RapidMiner to crawl a webpage for you. First, when you open up RapidMiner you have to make sure you have the Web Mining extension installed. If not, click on the Help menu at the top of the screen and click on "Update RapidMiner"

Then select and download the Web Mining Extension.

Once you have the Web Mining Extension downloaded, open the Web Mining folder under the Operators sections and then select and drag Crawl Web onto the Process section.

Once you have done this, you have to chose a website to crawl. How about we crawl this very blog. So, we copy and paste the url into the url box on the right side of the screen, under the parameters tab.

Then, you have to select an output directory for RapidMiner to save your files to. I've just chosen a folder on my desktop called "tmp". Then you want to select a file extension, I've chosen .txt. RapidMiner will save the files it crawls as text files. The Max depth is how many consecutive links the crawl will follow, I've chosen the default of 2. The domain tells you if the crawl will stay on the same server or allow it the crawl the entire web, I've left it as the default of web. Also I set the max threads, which is the number of CPU cores the crawl will use, I have set it to 4 in order to speed up the crawl. Then I have changed the user agent to that of my browser, to do this just go to http://whatsmyuseragent.com/ and copy and paste your user agent into the box.

Now we need to set up some crawling rules. So, click on the button next to crawling rules that says "Edit lists"

Then a dialogue box will open.

As you see, you can add crawling rules. The 1st rule establishes which links for the crawl to follow. I have set it up to .+auburnblogspot.+. This allows for it to follow any link with auburnblogspot in the url. The .+ says any amount of characters before and after auburnblogspot to be in the url. The 2nd rule saves only the pages that have auburnblogspot in the url.

Ok, everything should be set. Hit the play button, RapidMiner will crawl the web and save the webpages to the file you specified. Then you can perform your analysis on your saved files.

27 comments:

Yao-Te TsaiApril 21, 2013 at 1:21 PM
Thanks for your post. Sometimes RapidMiner is lack of some functions in web crawling. This function in Rapidminer can be used only if the url remains same after entering to another webpages. Lets use our blog as the example. Our posts belongs to "http://auburnbigdata.blogspot.com/", and we can easily use web crawling to grab posts by using your crawling rules. If we want to grab somethingelse from, for example, amazon.com. Then, it is hard to tell what kind of crawling rules that we should use. Amazon uses different platform to store webpages. In this case, we have to use other tools which are more complicated to crawl the data we want.
ReplyDelete
Replies
JeanMarch 2, 2014 at 6:46 PM
Thanks for this post.

Need quality web data to improve your workflow and design? Algoscale can upgrade your operations and processes while at the same time reducing your web crawling and data mining expenses.
Please visit http://www.algoscale.com/ and contact the site owner if you have questions.
ReplyDelete
Replies
UnknownMarch 30, 2014 at 2:26 PM
Hi.. I have followed the steps mentioned in the blog but i'm getting an error -
Failed to create directory: C:\Users\krijain\Desktop\Web Crawl
Please check if you have permissions to create a directory in the specified location
I'm not sure what went wrong. Anybody please help?
ReplyDelete
Replies
zen0xMay 3, 2014 at 12:32 PM
Hi, is there a way to skip pages that require authentication?
ReplyDelete
Replies
UnknownAugust 13, 2014 at 3:18 AM
Our online marketer and SEO experts offer a full range of online marketing (SEO) Services to assist you in attaining more visibility within the main SEO and gain more web contact for your target viewers. best web design software pakistan
ReplyDelete
Replies
UnknownSeptember 24, 2014 at 4:51 AM
great blog.thanks for sharing.

Online Reputation Management Services
ReplyDelete
Replies
UnknownJanuary 6, 2015 at 3:15 AM
Hi, I have followed all the instructions, but there are no webpage files. RM contains nothing. Not sure where I went wrong or what the outcome should contain. Thanks.
ReplyDelete
Replies
UnknownFebruary 23, 2015 at 10:41 AM
I am new to this software. I got the same issue as Kate did. I followed all the steps, but didn't get any output. Did I do anything wrong? Can anyone help? I am using version 6.3.000.
Thanks,
ReplyDelete
Replies
UnknownJuly 27, 2016 at 2:29 AM
This is a great list to use in the future..
سئو
ReplyDelete
Replies
UnknownAugust 15, 2016 at 1:32 AM
If you set out to make me think today; mission accomplished!
seo backlinks service
ReplyDelete
Replies
UnknownSeptember 8, 2016 at 2:24 AM
I just wanna say thanks for the writer and wish you all the best for coming!.
High Trus Flow and Citation Flow backlinks
ReplyDelete
Replies
UnknownFebruary 9, 2017 at 6:50 AM
Looking at ranking 1st in Google? Looking to grow your business? Look no further, we are Engage Online, an SEO Agency from Australia. Our SEO services maximise business marketing, boost traffic to websites, and raise Google Ranking
SEO Australia
ReplyDelete
Replies
DeemseenAugust 12, 2017 at 3:14 PM
Get IMMEDIATE Access To Over 1,000 Expired Domains With CNN Backlinks ($5-$11 Each). We Routinely
buy backlinksFind $5-$11 Domains With Links From Reuters, ...
ReplyDelete
Replies
AnonymousJanuary 3, 2018 at 4:14 AM
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in RAPIDMINER kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on RAPIDMINER We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Saurabh Srivastava
MaxMunus
E-mail: saurabh@maxmunus.com
Skype id: saurabhmaxmunus
Ph:+91 8553576305 / 080 - 41103383
http://www.maxmunus.com/
ReplyDelete
Replies
Zinavo-Web Design | Web Development | SEO | Mobile Apps | ERP/CRMNovember 19, 2018 at 12:54 AM
Nice information. Thanks for sharing the article in the blog. Web Designing Company in Bangalore | Web Design & Development Company in Bangalore | Website Designing in Bangalore
ReplyDelete
Replies
UnknownDecember 7, 2018 at 6:16 PM
Very Helpful Information, Thanks for sharing this post.
Digital Marketing Company
SEO Services Company
Website Designing Company
Website Development Company
PPC Services Company
ReplyDelete
Replies
Bangaloreweb guruJanuary 22, 2019 at 5:03 AM
Very thorough and informative article related to Web Development! Thank you for sharing this information Keep it .Website Design Company in Bangalore |Website Design Companies in Bangalore | Web Designing Company in Bangalore
ReplyDelete
Replies
Sarah AlfredMay 2, 2019 at 2:17 AM
Thanks for your post. Sometimes RapidMiner is lack of some functions in web crawling. Sky Potential is a leading big data analytics consulting firm , working in the UK, with over 8 years of experience of being a progressive name in the huge big data industry, where we are giving superlative administrations to our profitable customers and prospects. We give remarkable enormous information investigation arrangements, is furnished by our focused information researchers with helpful and imaginative arrangements. We keep your needs and set them as our very own need, making our esteemed clients fulfilled and satisfied. We give ideal computerized change of your enormous information and help in its total immaculate examination.
ReplyDelete
Replies
Rank CheckJune 25, 2019 at 7:52 AM
bachelor developer software company
SEO
graphic design
software company in hyderabad
software company in pakistan
Web design company
bachelor developer software company provide high quality software and websites and more services.
cheap website design pakistan
software house in pakistan
software house in sindh
ReplyDelete
Replies
Hir InfotechDecember 30, 2019 at 4:59 AM
This comment has been removed by the author.
ReplyDelete
Replies
Ankita KumariJune 16, 2020 at 7:02 AM
Thank you for sharing this informative post. Looking forward to read more about it.
Search Engine Optimization Services
ReplyDelete
Replies
Web Hosting in IndiaOctober 14, 2020 at 12:00 AM
One of the best blog posts I've read! Thanks a ton for sharing this!
Web Hosting Service In Surat
ReplyDelete
Replies
web hosting in dubaiOctober 22, 2020 at 10:04 PM
Thanks for sharing a useful information.
web hosting in abu dhabi

ReplyDelete
Replies
AnonymousMarch 6, 2021 at 12:54 AM
Thanks for suggesting good list. I appreciate your work this is really helpful for everyone. Get more information at Web scraping services. Keep posting such useful information.
ReplyDelete
Replies
SamApril 11, 2022 at 4:29 AM

Very Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs. Web Scraping Physician Review
ReplyDelete
Replies
AnonymousDecember 6, 2022 at 10:55 PM
Every one deserves to be happy in their family or relationships, I was lucky enough to get my husband back , Doctor Robbinson helped me he is authentic , My Husband left me for 8 months and I was devastated and couldn't find any way out to get him back i tried all ways. At some point I was depressed, while i was reading through some post on the internet where people testified about how Dr Robbinson helped them Solve similar problems I contacted him and he never hesitated to help me as well, My husband is back home, my happiness is restored
Text/call: +12267705795
Email: Drrobbinssonspiritualhome@gmail.com
ReplyDelete
Replies

Add comment