Big Data Science Technologies

The ability to run queries and programs on data, interactively, is a central capability for the practice of data science. Python and R have superb tools for interactive programming, visualization, and so-on. The picture gets more complicated when the data volume is so large that the analysis cannot be run on a single computer. There are several options which we find ourselves using to perform interactive analysis on "big" data sets.

Hive

Hive has long been popular, because SQL is so widely adopted and the adaptation of business intelligence systems to Hive is straightforward. Additionally, as a declarative language, it is easy to use interactively. The sticking point might be that it's inconvenient to capture results with a more expressive language, for higher-end analysis and visualization. We solve this problem with some custom code which converts Hive query results to a Python/Pandas DataFrame or an R DataFrame. In this arrangement, the "big data" code is written in HiveQL (SQL) and higher-end analysis is done in Python or R. The advantage of this approach is that it's accessible and can be trivially deployed on almost any existing Hadoop cluster. The downside is that all "big data" code must be written in HiveQL rather than Python or R, which limits options for what kinds of programs can be run in-cluster. This limitation can be Hive support Java user-defined functions.

Spark

Spark is a in-memory compute engine for big data. It is written in a functional language called Scala, and has APIs for Scala, Java, and Python. The Python API allows arbitrary Python code to run in-cluster. When combined with an interactive Python interpreter, such as IPython (with the web notebook), it makes a superb interactive analysis tool. The host of excellent libraries available from the Python community are available to run in-cluster (though they must be installed on the compute nodes). Spark is also ideal for an interactive analysis tool because it's ability to compute on data in-memory, brings a level of interactivity which until now has been impossible. We think Spark will grown in importance with time, and that it might be worth the investment in time to bring Spark to your cluster environment.

Pig

Pig has been the tool-of-choice for assembling big data workflows, ie., processes which require many steps and have numerous intermediate results. The language is a convenient, imperative, and expressive. It provides interfaces for several types of user defined functions which give neatly abstracted access to the Hadoop Mapper and Reducer interfaces, the partitioner, and data readers and writers. We view Pig as the default target language for complex big data processes, and Hive+Python/R or Spark as the prototyping language.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.