R for Data Science with Hadley Wickham

Tuesday, September 15, 2015 - 17:15

Hadley Wickham wasn't accessible at precisely the scheduled time for our Skype chat, but no problem on my end: He was deep in programming mode on the next version of his
R ggplot2 statistical graphics package that'll be released later this year. While at it, the prolific developer is putting the finishing touches on V2 of the accompanying ggplot book. Also on the docket for the fall are updates to data management library dplyr and data loading package readr, in addition to enhanced functionality for interactive graphics library ggvis. R analysts are the beneficiaries of this largesse.

Though Wickham didn't recall, we met at UseR! 2007 at Iowa State University, where he was then a PhD Statistics student. An R proselyte since 2000, I've seen Hadley emerge as an R tour de force over the last 10 years. Received with some apprehension as a “newbie” in the early days, Wickham's now rightly recognized as a leading contributor and spokesperson for the R community.

Wickham is probably more closely aligned with the data-driven and computational branches of statistical science – i.e. data science., even though he was trained in a traditional mathematics-based statistics PhD curriculum. He fully subscribes to the evolving undergraduate statistics curricula: “A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes.”

At the same time, Hadley agrees with eminent Stanford statistician Brad Efron that the skepticism and formal modeling of Statistics remain essential – and contrast with the often credulousness of machine learning and big data. He's quick to note that Statistics simultaneously supports both informal exploration and formal inference.

Expanding R's purely statistical capabilities, Wickham sees his primary role in the R community to teach and develop tools for data scientists. Several of his R packages such as ggplot, dplyr, tidyr, lubridate, and readr don't so much provide new functionality as they do a consistent grammar and coherency around the critical data science tasks of data ingestion, munging, tidying, management, and exploration. Like BASF, he sometimes doesn't provide totally new capabilities, but rather makes existing capabilities better. Yes, as critics have claimed, much of the work could be accomplished with base R capabilities -- but at a quirky and often frustrating programming cost. And Hadley's tools fit together seamlessly.

Having traded a full-time academic position at Rice University in Houston for an adjunct teaching role, Wickham now has his dream job at R commercial open source vendor RStudio. The company makes money selling server versions of its ubiquitous development and web application framework platforms. Hadley sees a hybrid role for himself at RStudio and in the R community – statistician, computer scientist, cognitive psychologist and communicator/educator – and seems equally facile with each. At RStudio, he gets to do all the things he loves: design and program tools for R data scientists, conduct statistics and data analysis, write books and papers, and present to the R and statistical communities. Life is good.

Like many, Wickham acknowledges that R isn't the perfect language, but argues convincingly for its functional capabilities in his excellent book Advanced R. Sensitive to R's competition with Python for the mantra of leading data science language, Hadley prototypes capabilities in R to compete with established Python libraries like screen scraper beautifulsoup. Python package developer Wes McKinney, author of R dataframe-competing Pandas, in turn does similar work for Python. The R-Python competition is nothing but good for the data science world.

Wickham accepts R's interpretative and memory limitations, but notes that a terabyte of RAM and R's ability to collaborate with languages such as SQL and SparkR mitigate the deficiencies. Better to have R do well what it's good at rather than try unsuccessfully to be all things to all analysts. At the same time, he agrees with John Chambers, creator of the S programming language, and core member of the R programming language project, that having multiple ways of accomplishing a task in R is a benefit, not a liability.

Given its growing presence in the commercial world and establishment as lingua franca of academic statistical computing, Wickham sees a bright future for R. He opines that Microsoft's acquisition of commercial R vendor Revolution Analytics was in balance a significant positive for R, providing validation of the platform's capabilities to the larger marketplace.

I await his updated fall statistical goodies.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.