Saturday, February 16, 2013

Facebook's Corona

Facebook 're-invents' the wheel 


We have discussed Hadoop in our data mining class. Although we did not extensively work on Hadoop in class, it has numerous applications and can be applied to various domains in data mining. Hadoop is open source software which makes it really easy to implement in various platforms.  It enables the distributed processing of large data sets across clusters of servers. A significant feature of this software is to detect the failures at an application layer than relying on hardware. The following few lines summarizes an article which shows that implementation of Hadoop platform, Facebook will make better use of clusters than MapReduce. Facebook has used source code for scheduling workloads on Hadoop platform which they claim to be their own. This also known as ‘Corona’ has achieved superior results. The Corona scheduler was able to utilize 95 % of a cluster to work on jobs compared to a 70 % utilization using MapReduce. Facebook’s operations and users generate more than half a petabyte of data each day.  Corona offers a number of additional benefits as well, including faster loading of workloads and a more flexible way of upgrading the software. Corona took around 55 seconds to fill an empty workspace, which is a 17% improvement from MapReduce. Typically, analysis jobs running on Hadoop are scheduled through the MapReduce framework, which breaks jobs into multiple parts so they can be executed across many computers in parallel.

The Corona software used overcomes the fact that the cluster utilization will not drop during peak scheduling overloads. The company is now deploying Corona for all workloads. Some of the limitations of MapReduce are that the software used to delay the jobs before executing them. This does not work very well with the Facebook interface as it relies on quick scheduling and any downtime will raise havoc and generate negative publicity. Corona is a software which will scale more easily and make better use of clusters.  It would offer lower latency for smaller jobs and could be upgraded without disrupting the system. Corona was developed in house with the Facebook team and they evaluated some other alternatives such as Yarn, which is Apache’s overhaul of MapReduce. But they were unsure Yarn could process clusters with a social networking application which processes petabytes of data.

Corona will be run on non-MapReduce jobs, therefore making a Hadoop cluster more of a general-purpose parallel computing cluster. Facebook is also trying to incorporate online upgrades, which would mean a cluster doesn’t have to come down every time part of the management layer undergoes an update.

Facebook is extremely confident that they will make this in house product to successfully work. They have officially posted a link on the facebook website explaining the the scheduling of MapReduce jobs and the limitations they have overcome by implementing 'Corona'.
http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920

             Reference: By Joab Jackson | IDG News Service
:http://www.infoworld.com/d/business-intelligence/facebook-boosts-hadoop-scheduling-muscle-206731

5 comments:

  1. The reason for delay and overload are mainly because, what Facebook was doing involved taking a job tracker program along with many task trackers and executing them so that it processes the data. The role of the job tracker was to manage the cluster resources and schedule all the user jobs, thus funneling it all to the individual task tracker programs. But as the number of jobs increased, the job tracker program just couldn't handle its responsibilities adequately enough for Facebook’s needs, the company said that its cluster utilization would drop because of the overload.
    Other frustrations that it found with MapReduce included the fixed “slot-based resource management model” which divides the cluster into a fixed number of map and reduce slots designed by a specific configuration, it felt this was inefficient because slots become wasted anytime the cluster workload doesn't match up with the configuration. Also, when software upgrades needed to happen, Facebook found that all running jobs needed to be “killed” or cease to operate, resulting in significant downtime.

    ReplyDelete
    Replies
    1. Sam,

      Please provide references. Also, see my comment to Prashant below.

      Fadel

      Delete
    2. Reference: http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920

      Delete
  2. Prashant,

    Can you please provide the URL for the reference? Also, are there any other articles that support this view? I am assuming that this is a controversial topic so it would be good to see both sides of the story :)

    Fadel

    ReplyDelete
  3. Thanks for the feedback on this post. It is a new implementation by using their own in house project. Facebook's data warehouse has grown over 2500 times in the past 4-5 years and they are confident that this implementation will result in more efficient processing.
    Facebook will be making upgrades to Corona as time goes by, since it is now a core part of the company’s infrastructure. This was published in November 2012. It is interesting to see how this turns out as it might not have been fully converted to scheduling with Corona. It could possibly solve their data issues on a 'Facebook' scale. I have not found any adverse comments to this implementation. We could possibly hear about some trouble they faced with 'Corona' in the following few months. This is an interesting piece and I will follow up on it and post if I get any articles hearing the downside of 'Corona'. For now all seems to be good with their latest in-house innovation:)

    Thanks
    Prashant

    ReplyDelete