One of the collateral benefits of participating in Strata+Hadoop World is the opportunity to catch up with old colleagues. After one particular session, a group of five got together to talk data science. Two of us were pretty much R guys, two were Python, and the fifth, SAS. Over beers, the topic of favorite supervised learning models came up. Not surprisingly, there was no shortage of opinion on the relative merits of different techniques.
The last to weigh in, I had the benefit of time to put my thoughts together as I listened to the others. Rather than just list my favorites, I thought it best to first identify a few practical criteria I deem important when examining models. Though not intended to be comprehensive, I noted the following as factors I generally consider when choosing model(s) for prediction tasks.
- resources/performance – time and computer
- can handle large N (# of records) and p (# of features)?
- white vs black box – are the models produced interpretable?
- can handle both regression & classification problems?
- ability to penalize/regularize? – avoid the overfitting problem that plagues PM
- can handle curvature/interactions?
- fit with the R environment –
- both formula and matrix interfaces?
- comprehensive model object and accessor functions?
- feature importance function?
- easy-to-use prediction function?
- cross validation functions?
- recognition by caret? – an R package that abstracts and simplifies model building
When finally it was my turn, I identified four models, “lm” from the stats package, and “earth”, “glmnet”, and “randomForest” from packages with the same names. Each of these functions is known to caret.
lm is R's modeling workhorse, performant and capable, within the memory limits of R, of handling “large” N and middling p. As a statistical regression model, lm's white box, with feature importance indicated by parameter significance levels. I generally use cubic splines functions to represent curvature in lm, while the formula interface accommodates interactions. lm fits into the R programming environment quite well, and has easy to deploy cross validation functions. Alas, lm has limited ability to regularize, and isn't suitable for classification problems, though it's glm cousin is.
earth is R’s adaption of multivariate adaptive regression splines (MARS). It has become my go-to model, seemingly always a capable performer regardless of the challenge I throw at it. earth performs quite well, though is a bit more resource consuming than lm. It can accommodate both regression and classification challenges, automatically handles non-linearity with splines, has built-in cross-validation functions, and can penalize to avoid overfitting. Over the years, I've successfully deployed earth for regression, classification and time series forecasting problems.
I was first exposed to glmnet a few years ago at a statistical learning seminar taught by package co-authors Trevor Hastie and Rob Tibshirani. glmnet covers most of my evaluation bases, as the vignettte introduction attests: “glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix. It fits linear, logistic and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted models. It can also fit multi-response linear regression.” Sensitive to the problem of overfitting, I'm ok with the lack of formula interface, and am now turning to glmnet increasingly.
randomForest is R's implementation of Breiman’s computationally-intensive algorithm that incorporates bootstrap aggregation of regression trees. It satisfies all my criteria except resources/performance, where it lags the choices above. Though randomForest is thus consigned to small N problems for me, it's prediction performance is always near the top of the list.
Interestingly, each of the other discussants' choices overlapped mine to some degree, providing at least partial affirmation of my picks. For R DS'ers, the good news as well as the bad is that there are so many competing models from which to choose.