Applied Predictive Modeling

Monday, August 19, 2013 - 15:30

 

I just finished working through a terrific new book, “Applied Predictive Modeling,” by Max Kuhn and Kjell Johnson. K & J present a comprehensive practitioners’ guide “to the predictive modeling process and a place where one can come to learn about the approach and to gain intuition about the many commonly used and modern, powerful models.” APM’s orientation is applied and conceptual with little obsession for arcane math. And hands-on’s the goal, so there’s lots of well-written R statistical language code to learn from.

The book primarily emphasizes the supervised learning problem in which models are developed to predict an outcome or dependent variable as a function of features or independent variables. The outcome measure can be either numeric, in which case the models are called regression, or categorical, where they’re considered classification. In both cases, features can be numeric or categorical. Some of the discussed methods produce model coefficients than can be interpreted, while others are “black box”.

Max Kuhn is a luminary in the R world, author of the highly-regarded caret(Classification And REgression Training) package that provides a “set of functions that attempt to streamline the process for creating predictive models.” For those of us who struggle with the disparate interfaces of the many R modeling packages, the unified framework and power of caret is a godsend. Indeed, for many predictive modeling problems, the functions of caret can handle all the work – from pre-processing to training to tuning to evaluating. I can assure predictive modelers they’ll appreciate caret if they haven’t tried it.

If, like me, you purchase the hardcover edition of APM through Amazon, you may have to wait several weeks for delivery. Don’t let the time lapse deter your learning. R programmers have access to the AppliedPredictiveModeling package from CRAN. Once that’s installed and loaded, you can use the getPackages function to install/load the book’s referenced R packages, chapter data and sample code. The wealth of R library functions installed via the book’s packages is alone worth the investment in APM. Once getPackages has completed its work, the function scriptLocation will tell you where the code files reside on your computer. From there, you have all you need to run through the book’s comprehensive examples. If that’s not enough, there’s also an informative APM web site with the not-surprising URL, “appliedpredictivemodeling.com”.

Those whose exposure to statistics is limited to introductory and regression courses are likely to be surprised by some of the “statistical” content of “Applied Predictive Modeling.” Rather than like the material you learned in school, APM is more similar to statistical learning, which sits between statistics and data mining. Indeed, many of the methods discussed are core to the SL “bible”: “The Elements of Statistical Learning” by Stanford Statistics professors Trevor Hastie, Robert Tibshirani and Jerome Friedman. Ironically, along with students Gareth James and Daniela Witten, Hastie and Tibshirani have just published: “An Introduction to Statistical Learning with Applications in R”, a text that addresses the same applied modeling space as APM.

Readers will see much less mention of parametric, multivariate normal, coefficients, standard errors, p-values, t-tests, ANOVA and best linear unbiased estimates (BLUE) in APM than in college statistics texts, but much more of bootstrap, cross-validation, regularization, shrinkage, bagging and boosting. You’ll also experience something you might have thought impossible back in the day: procedures that can accommodate data with more features than cases – i.e. p > N!

What you’ll assuredly be exposed to in APM is a dizzying array of models for both regression and classification problems. My recommendation for consuming APM’s bounty is to very carefully cover chapters 1-6 that include a short tour of the modeling process, data handling, over-fitting/testing/ tuning, and the linear regression ecosystem of modeling, including regularization. Work through all examples with the code and data sets provided, making sure you understand the computations. To test your learning, use a data set outside the book materials and work through similar calculations with modified scripts you develop.

Once you’ve reached that competence level, you can pick and choose the models to investigate, driven by whether your needs are for regression or classification. I’d recommend a look at a re-sampling technique like random forests/gradient boosting, and a “black box” method such as neural networks or support vector machines. The case study is quite informative and the chapters on model performance, feature selection and predictor importance should not to be missed.

One critique I have of statistical learning pedagogy is the absence of computation performance considerations in the evaluation of different modeling techniques. With its emphases on bootstrapping and cross-validation to tune/test models, SL is quite compute-intensive. Add to that the re-sampling that’s embedded in techniques like bagging and boosting, and you have the specter of computation hell for supervised learning of large data sets. In fact, R’s memory constraints impose pretty severe limits on the size of models that can be fit by top-performing methods like random forests. Though SL does a good job calibrating model performance against small data sets, it’d sure be nice to understand performance versus computational cost for larger data.

With my own work, I love random forests (RF) and gradient boosting (GB) for small (< 50,000 record) data sets. For medium to large data, I often have success with OLS regression with cubic splines, the elastic-net, general additive models (GAM), and multivariate adaptive regression splines (MARS), each of which is much less resource-consuming than RF or GB. I’m now in the process of investigating the SL possibilities of Python built on top of the Numpy and Scipy libraries.

Two enthusiastic thumbs up for “Applied Predictive Modeling.” If you invest the time and effort, you’ll learn an awful lot about state-of-the-art predictive modeling in APM. You’ll also learn a good deal about R programming, become conversant in caret and turbo-charge your R code library.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.