Tuesday, March 26, 2013

StumbleUpon - The red-headed stepchild of web crawling

If you have red hair or are a stepchild.. I wouldn't read any farther... Kidding.

The recent exposure to the births and lives of search engines coupled with our focus on user interaction with recommendation systems made me think about the construction of a famous "social bookmarking" site we're all familiar with - StumpleUpon.

The reason for this blog title is because of the nature of StumbleUpon. If you research (via Google) how StumbleUpon is constructed and peruse the site yourself, the conclusions you come to are those of mystery and interest - much like the specific personality type indicated in the post title.

There is a site called makeuseof.com that has quite a few posts that attempt to showcase things on the web and tell users how they work. The post about StumbleUpon (found here) is pretty interesting because the guy (also his name) that posted it doesn't really have a clue if he's right or not because the site keeps the web crawling process they use as a well-guarded secret. However, he does come to some pretty logical conclusions. So, using his post and a few other references, I'm going to take a stab at hypothesizing how this site crawls the web.

First, let me tell you how this site works. After creating a profile, the user selects from a set of initial interests. These are keywords like "movies" or "quotes" that become the starting point for your stumbling experience. Then, you begin to stumble. Much like Pandora, as you happen upon a site that you like and want to see more of (content wise) you click the "thumbs up" icon at the top of the screen and conversely for things you don't want to see. Other options such as bookmarking the site in lists that you generate are available as well.

Guy McDowell, the author of the post, begins to address this question with an assumption - that this site operates from a database. He believes that SU has a vast archive of websites that have been categorized based on content. I went through a large number of blogs and sites to see if anyone had postulated to where the beginning actually comes from, but I couldn't find any. If you do, please let me know. I'm interested.

After your initial SU DNA is loaded, the user begins to surf sites that are suggested by the site based on these selected interests. So, just like any of the sites we've mentioned before - as the amount of user activity increases so does the precision of the suggestions from the site.

Here is a look at what Guy thinks the process looks like:
There are a few things that interest me about what this guy is saying. So, essentially SU would depend solely on the activity of its users to recommend sites. The page ranking is done by the users and the categories (I assume) are from some vast archive. I'm having trouble believing that. It seems that there would be more to it.

So, I figured I'd look at it from the other side - sites being suggested. I found an article that instructs sites on how to increase their presence on StumbleUpon (found here). What I gained from this article is that there is a lot of significance in the social presence of the site. Things like the amount of "followers" and quality of posts are encouraged areas of focus. This lends me to believe that Guy may be accurate. However, the site prides itself on being a machine of "discovery" as opposed to a search engine in that it exposes users to unique sites that they would not have otherwise found. So, they have shunned the "most connected with links" approach and replaced it with recommended utility and interest. However, this is oxymoronic to me (I made that word up). How can you claim to be showing unique and special sites that are off the beaten path when you're model suggests sites on the highest net utility?

I think that SU does have some type of archive of websites that they categorize. However, I don't think that's the starting block. I think that they rely on some method of webcrawling using keywords to identify sites that could be of interest. Then, they migrate that site into their archive for a trial period. During the trial period, they expose the site to users and monitor response. If the site is a dud, it gets moved to the StumbleUpon recycle bin. If it does well, it stays in circulation. I think a similar algorithm to Amazon is used and compares interests amongst demographics and user responses to hone in on a user's preferences.

I don't know the answer, but this model is pretty cool because it really does work well in some cases. Especially for topics like design and photography.

No comments:

Post a Comment