Inquidia recently completed our seventh season of college recruiting, interviewing 30 candidates from three top universities, ultimately hiring two Spring 2015 grads. The newbies include a natural sciences major and a joint econ/computer science student.
We look for expertise in the scientific method, statistics/machine learning, data management, and computation – which we consider the foundations of both business intelligence and data science. Ideally, students will be at an intermediate undergraduate level of statistical knowledge that includes experimental design, regression and forecasting. Data skills include SQL and wrangling/munging with scripting languages like Python and Ruby. And computation can include programming expertise in Java/C++, Python/Ruby and R/SAS. The acquired expertise can either be course or self-taught. A capstone project or internship combining stats, data and programming is ideal.
I'm often asked what majors produce these candidates. The good and bad news: There seems to be no "perfect" curricula for Inquidia's needs. Inquidia hires computer science majors, engineers, science majors, business, math, econ, actuary and statistics students – as well as some humanities.
We generally find that specific majors aren't a perfect fit – and are often pleasantly surprised by "non-predictor" candidates who connect the academic/commerce dots in the interview process. Often it's the research and internship work that seals the deal. Computer science and math majors can be great but can also be overly theoretical, while students without pertinent work experience may not have the requisite data and computation skills. Several recent physical/natural science majors with strong research backgrounds have become capable apprentices.
The Statistics Challenge
A statistics major back in the day, I naturally gravitate towards the discipline when first looking at resumes. After all, exploratory data analysis and predictive analytics/modeling have statistical foundations. Alas, I'm pretty disappointed with most current statustics curricula, which seem to have changed little from my time, emphasizing high-end math to the detriment of computation, utilizing curated, well-behaved data to exercise the methods.
Indeed, I corroborate the observations of statistics PhD data scientist Rachel Schutt, who noted on starting her first post-graduation job at Google, "It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had learned in school when I got my PhD in statistics...But there were many skills I had to acquire on the job at Google...(I) picked up more computation, coding, and visualization skills, as well as domain expertise while at Google.”
Schutt's Columbia advisor, eminent statistician Andrew Gelman, piles on in distinguishing academic statistics from commercial data science. "There’s so much that goes on with data that is about computing, not statistics....statistics is the fun part and the part that, in this context, is new. The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option. To put it another way: you can do tech without statistics but you can’t do it without coding and databases.”
All's not gloom with undergraduate statistics, fortunately, as a recently-published paper, “Data Science in the Statistics Curricula: Preparing Students to 'Think with Data'”, attests. The authors first embellish the frustrations reported by Inquidia, Schutt, and Gelman:
“A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes....To be relevant and able to tackle the data-rich problems of today, as well as the ever-increasing data challenges for the future, our students’ statistical problem solving skills must be enhanced to incorporate practical computational aspects.... computational thinking involves “the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can effectively be carried out by an information-processing agent.”
The paper continues: "As statisticians our first inclination is to focus on the inferential questions instead of the computational aspects. However, as we encounter more and more complicated questions, data structures, and algorithms, the computational and algorithmic aspects become an integral part of arriving at a principled and statistically sound solution....(D)ata science is the computational aspect of carrying out a complete data analysis, including acquisition, management, and analysis of data (Johnstone & Roberts, 2014). We would also add data munging and manipulating, visualization, and statistical machine learning to the NSF list of data analysis tasks.”
The curriculum prototypes proffered by the authors offer encouragement to those organizations seeking budding data scientists. All seven appear to have merit, but two courses/programs, one jointly between UC's Berkeley and Irvine, the other from Johns Hopkins University, are noteworthy.
Johns Hopkins' Courses
In contrast, Data Science Specialization, developed by Johns Hopkins professors Roger Peng, Jeff Leek, and Brian Caffo, expands the university curriculum the trio developed to a massive online open course (MOOC) platform in Coursera. The nine segments (each four weeks) cover the breadth of data science, from (1) The Data Scientist’s Toolbox; (2) R Programming; (3) Getting and Cleaning Data; (4) Exploratory Data Analysis; (5) Reproducible Research (6) Statistical Inference; (7)Regression Models; (8) Practical Machine Learning; and (9) Developing Data Products.
Inquidia consultants routinely partake in the Johns Hopkins program and without exception claim it worthwhile. My nephew, a biochemistry PhD now self-training in Bioinformatics, has completed six of the segments and is a believer. And I know someone who recently completed an MS in Analytics program who swears the Data Science Specialization is a more rigorous curriculum.
The long-overdue expansion of college statistics curricula into data and computation is nothing but good for the analytics world. If I were just beginning my college days, I can think of no major I'd prefer to data science-complected statistics. I'd encourage undergrads seeking entry into data-driven business to give it serious consideration.