Steve Yun at Strata conference in New York on October 25, 2012 |
Steve Yun, Principal Predictive Modeler at Allstate's Research and Planning Center, presented at Strata Conference NYC 2012 a comparison of differing methods for big-data insurance models. The four methods are:
- Proc GENMOD in SAS - *standard practice at Allstate
- Installing a Hadoop cluster
- Using open-source R
- Running the data through Revolution R Enterprise
According to the article, Yun states that the need to compare differing methods to fit their model is due to the fact that it is hard to be productive when the current practice of Proc GENMOD in SAS (a very popular modeling software in the insurance industry) takes approximately five hours to return results for
the model with 150 million observations. With Hadoop, the model needed five to ten iterations, with 1.5 hours for each iteration to return results (even then, according to Yun, Hadoop left several irregularities with singularities in the design matrix). Since the total time to run the model in Hadoop takes twice as long as the current method, Yun ran the program with R on a server containing 250GB of RAM (R runs in-memory). However, the server memory was still inefficient with that much RAM- the data simply would not load even after three days. Yun was able to partition the data into 10 clusters and run the program in about 30 minutes, but determined that it would be difficult for managers to accept a process that involved sampling. Steve Yun then consulted with Revolution Analytics' Joe Rickert to see how long the model would compute using Revolution R Enterprise. Using this method, Revolution R Enterprise was able to process the data in 5.7 minutes- producing the same output as SAS in approximately 1/50th of the time. Steve Yun's presentation video and slides for Strata NYC 2012 can be found here.
In class this past week, we have been discussing several clustering
methods to process extremely large amounts of data. While the article
presented may not discuss clustering in detail, it does briefly describe a potential limitation of sampling data with clustering when trying to process a big data model. As Yun stated, R was able to effectively process the data, but the model was not as accurately representative of the entire data set when using sampling, and therefore would not be accepted for use by the company managers.
No comments:
Post a Comment