Causal Modeling for Data Science

Monday, May 11, 2015 - 14:30

A few weeks ago, I received a Facebook link for lecture notes “Research Design for Causal Inference ”, from the Harvard graduate class, Government 2001, “Advanced Quantitative Research Methodology”, taught by professor Gary King.

Driven by a chance meeting and a common passion for R, I befriended Gary eight years ago and have subsequently benefited from both his work and research from Harvard's Institute for Quantitative Social Science (IQSS), which he heads. As I noted in a previous blog, some of the groundbreaking methodology being developed by the IQSS is a bellwether for what the business world calls data science.

This isn't by accident. “The IQSS has progressed quite a ways from the traditional political science methods of surveys, government statistics and one-off studies of people, places and events. In fact, even though King calls what he and his colleagues do quantitative social science, he’s just as comfortable with the business monikers of big data analytics and data science.”

I suspect Gary wouldn't be comfortable with big data/analytics as proselytized by the 2013 best-seller “Big Data: A Revolution That Will Transform How We Live, Work, and Think”. For authors Viktor Mayer-Schönberger and Kenneth Cukier, big data with correlations is often good enough, especially in the prediction world, where it's not necessarily important to understand underlying causes. “...big data may offer a fresh look and new insights precisely because it is unencumbered by the conventional thinking and inherent biases implicit in the theories of a specific field.”

For King, though, Data's meaning is only uncovered through the analytics that gives them life. His is a quest for interpretation, theory validation and “causal inference”.

This “prediction vs explanation” divide is often encountered in the statistical world, with commerce generally more focused on pure prediction, while research/academia is obsessed with explanation. And with model building, there's a trade-off between minimizing explanation “bias” and prediction “variance”. “In explanatory modeling the focus is on minimizing bias to obtain the most accurate representation of the underlying theory. In contrast, predictive modeling seeks to minimize the combination of bias and estimation variance, occasionally sacrificing theoretical accuracy for improved empirical precision.”

Thus, with prediction quality first and foremost, business may trade interpretable, theory-challenging models for “black box” algorithms that deliver smaller prediction errors – i.e. sacrifice bias for accuracy.
Research Design for Causal Inference is all about having the statistical cake and eating it too. Consider an example of a marketing campaign in which a company is interested in contrasting the impact of two different offers on the buying behavior of its customer/prospect base. Ideally, each prospect would be subjected to both offers, after which purchase outcomes could be measured and compared. Alas, that's generally not feasible and leads to the fundamental problem of causal inference: for each “subject” the “outcome” for only one of the “treatments” is ever observed.

So how do analysts determine an unbiased assessment of the relative performance of the offers? The current platinum standard is, of course, the experiment in which prospects are randomly assigned to the offer A and offer B groups. Randomization serves to dampen potential bias, since factors other than the offer itself that could influence buying behavior should be “equal” between the groups. The most menacing of such potential biasing factors is often “selection”, in which the groups differ systematically by attributes such as geography or income that might reasonably explain purchase behavior instead of the offer itself.

But what if a randomized experiment isn't feasible? How can the potential bias of uncontrolled factors be minimized? This is where RDCI shines and proposes state of the art methods to analytics practitioners.
One technique that's often used in non-experimental settings to mimic randomization is “matching”, in which factors which may be systematically different between treatments are identified and balanced between groups, so that “other things are equal”. Another common statistical ploy is “blocking” of nuisance factors to reduce variation. “Propensity models” statistically summarize the influence of outside factors and adjust the observed outcomes accordingly. Which approaches are best depend on the situation and must be finessed by analysts.

I'm a big fan of causal inference and think it's an important tool for data scientists. Though my “bias” is more to optimize prediction over explanation, and I sometimes feel theory can be too constraining, I believe the use of research designs for causal inference should be central to analytic methodology. Perhaps a data science linkage: theory (maybe)-->data-->designs-->analytics is appropriate. The ability to assure the reliability and validity of analytic findings is at stake.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.