Tuesday, April 23, 2013

OutWit Hub: Web-scraping made easy

I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.

OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.

I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.

I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom

Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).

So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.

Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT

Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.











19 comments:

  1. Thank you so much for posting this! It is such an awesome tutorial. I have attempted to do webscraping before using RapidMiner (I even was going to post a tutorial about it; assuming I could get it to work), but I was unable to find more than a couple of resources to learn how to do so. My attempt only allowed me to, for example, scrape the first page of search results for a common realtor site. Further investigation into learning how to scrape the remaining results required me to be proficient in Regular Expression, Python, or both (YUCK). OutWit Hub looks like a great option for those like me that may not be necessarily proficient in a particular programming language (like HTML), but has a GUI that allows us to figure it out rather easily on our own and get our data!

    ReplyDelete
  2. I have a site where I have:
    - A listing page: This page is like a category page in a directory. This page has links to products
    - Products page: I have product specification (a particular thing) on this page.

    Listing pages have pagination. So there are about 500 listing pages and 20000 product pages.

    I am able to get the details manually. But i cannot scroll to 500 pages manually.

    How can I do this in Outwit?

    ReplyDelete
  3. This is nice article. There are many website scraping company using scraping tool which used to scrap data from any website but have some limitation. It acquires data from all websites including the ones with complex extraction routines and those using AJAX and JavaScript. For More information visit: http://www.loginworks.com/blogs/web-scraping-blogs/why-is-a-custom-web-scraping-service-better-than-scrapercrawler-tools/ and

    ReplyDelete
  4. Web Scraping is a technique for scrap data for any website. There are many tools are available for web scraping which provides data like as import.io, cloudscrap, 80legs etc. But these have some limitation regarding the quantity and format.

    There are some Website Scraping Company which provides Custom Web Scraping Service.
    Grepsr
    Promptcloud
    habiledata

    ReplyDelete
  5. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.
    scrape a website

    ReplyDelete
  6. I have built custom web scraper for my need using dotnet technology.I also scrape using php,python etc.Here is my website to look at : http://prowebscraping.com

    ReplyDelete
  7. This is really very informative post. Thanks for sharing such a useful knowledge.
    web scraping services

    ReplyDelete
  8. In addition, extracting data with the help of scraping software is not a piece of cake for everyone. You need to get yourself trained before using the software, since it is complex to use.
    scraper bot
    data extraction services
    web crawling services
    web scraping services

    ReplyDelete
  9. Best information about software.Thanks for sharing such great information. hope you keep sharing such kind of information Web Data Extractor

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Hello, You have posted such precious and informative article which gave me lot of information. I hope that you will keep it up and we will have more informative and helping news from you. Thanks Data mining services

    ReplyDelete
  12. With TemplateMonster's PowerPoint templates for sale your presentation will look professional without having to spend time on design.

    ReplyDelete
  13. 5. Hello! Great article and thank You for Providing Such a Unique and valuable information on The datamam for your readers. I really appreciate it. You can also visit Best Web scraping services provider for more datamam related information and knowledge.

    ReplyDelete
  14. sous traitance informatique I am typically to blogging and i actually recognize your content. The article has actually peaks my interest. I am going to bookmark your web site and maintain checking for brand new information.

    ReplyDelete
  15. Wow nice very informatic article thank you for sharing the valuable content

    ReplyDelete