One of my colleagues recently asked if I'd do him a favor and discuss opportunities with a “data scientist” friend of his. I agreed, and the friend and I had a nice discussion about his insurance/risk modeling background. Over the years, he's done interesting work combining knowledge of the industry with SPSS predictive models, supported by the data engineering capabilities of his company's IT department. In the end, though, I confided to my colleague that I pegged the friend as a business analytics expert rather than as a data scientist.
A data scientist to my thinking has business knowledge and is skilled in both computation and statistics/machine learning. I kind of like the saying that data scientists can program better than statisticians and have more statistical chops than programmers. Many in the industry, though, equate a data scientist with a business statistician.
I ran into a pertinent presentation on the topic recently by statistician, Diego Kuonen. There's a lot to like about Kuonen's position, especially as he takes on the data hype. The definition of data science as “the scientific study of the creation, validation and transformation of data to create meaning” he sites is noteworthy. Alas, I was ultimately disappointed by Kuonen's declaration that 'data science is rather a rebranding of data mining,' borrowed from “Data Science for Business” by Foster Provost and Tom Fawcett.
A broader positioning of DS to include computer science and database systems in addition to machine learning was offered by statistician Michael Mout in a KDnuggets blog a while back. And statistical colleagues Schenker, Davidian and Rodriquez agree that data science is more than just statistics. “Ideally, statistics and statisticians should be the leaders of the Big Data and data science movement. Realistically, we must take a different view. While our discipline is certainly central to any data analysis context, the scope of Big Data and data science goes far beyond our traditional activities. As Bob noted in his column, the sheer scale and velocity of the data being generated from multiple sources requires new data management and computational paradigms. New techniques for analysis and visualization must be developed. And communication and leadership skills are critical.”
These disparate positions are not unlike those I noted in a blog posted several months ago comparing "Data Science for Business" unfavorably with the outstanding “Doing Data Science: Straight Talk from the Frontline” by Rachel Schutt and Cathy O'Neil. Schutt confides that early career experiences shaped her thinking on data science. “It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had learned in school when I got my PhD in statistics...But there were many skills I had to acquire on the job at Google...(I) picked up more computation, coding, and visualization skills, as well as domain expertise while at Google.”
Perhaps it's instructive to depict the work of Provost and Fawcett as the New York University School, where Provost teaches, competing with Schutt and O'Neil's Columbia, where the two teach and Schutt earned her PhD. The NYU thinking equates data science to data mining/machine learning, relegating “data engineering” to a support role. The Columbia approach, on the other hand, sees the data scientist as “someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning…..She spends a lot of time in the process of collecting, cleaning and munging data…This process requires persistence, statistics and software engineering skills.”
In his not-to-be-missed blog, Schutt's adviser at Columbia, noted statistician Andrew Gelman, is even more unabashed in chronicling differences between data science and statistics. “There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science.....The question then arises: why do descriptions of data science focus so strongly on statistical tasks? (As Schutt and O’Neil write, “the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry.”) I think it’s because statistics is the fun part and the part that, in this context, is new. The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option. To put it another way: you can do tech without statistics but you can’t do it without coding and databases.” A second Gelman blog, Statistics is the least important part of data science, reiterates these points.
Count me squarely in the Columbia camp of Schutt, O'Neil and Gelman in emphasizing the importance of both computation and statistics in defining data science. One question I ask myself of all prospective Inquidia (formerly OpenBI) data scientists is how long it'd take them to become proficient in the growing Spark ecosystem that combines big data, programming and statistics/machine learning. A big hurdle in front of any of the three will likely end discussion early. And as Inquidia approaches college recruiting in October, I'll trade two courses in statistics for one in computation in a heartbeat.