Even as recently as five years ago, most folks had one of two technologies in place for managing their data: relational databases or nothing. Some data was just too big and too unstructured to make it worth managing. With the popularization, and increasing stabilization of Hadoop, there's a new kid in town, and its a pretty big deal. Like, an elephant-sized deal.
The Hadoop Ecosystem
Hadoop is now the standard for dealing with big data, and, at this point, is quite mature. Many of our clients have deployed Hadoop to meet their scalable storage, processing and, increasingly, query needs. It is not, however, just one thing -- it is a giant ecosystem of tools and technology which solve a wide variety of problems.
For the uninitiated, Hadoop, itself, is a distributed filesystem and framework for distributed computation. The Hadoop ecosystem, however, is vast and contains a variety of tools which interface with Hadoop and/or solve problems related to dealing with big data, for example tools for ingesting data from conventional RDBMS, process orchestration, cluster management, monitoring, query, various distributed databases, libraries for machine learning, and so-on. We can help you choose the right set of tools from the ecosystem which will meet your needs.
Ingesting data into Hadoop
Data is generated outside the cluster. Sometimes in a production transactional system. Sometimes, as log files, from web and ad servers. Sometimes by event capture software producing a stream of JSON datum. Sometimes it's purchased and acquired through an API or data export/download from a 3rd party. In all cases, there is an appropriate tool to bring it into your big data environment.
We've worked with a variety of technologies to ingest data: Flume, Kafka, Storm, Sqoop, Spark Streaming even HDFS command line scripts. Whether your are doing nightly database copies, frequent micro-batches or streaming live feeds, we'll help you engineer a reliable and scalable data ingestion solution
Processing data within Hadoop
Once your raw data is in Hadoop, you'll need to process it into datasets that serve various query and analytical purposes. Java MapReduce has long been the workhorse for Hadoop processing, but now higher order languages and libraries are providing greater data engineering productivity.
Pig is an imperative language that provides a grammar for specifying relational algebraic operations on nested data structures. When pig scripts execute, a sequence of MapReduce jobs are automatically generated and linked to produce the end result. If the Pig language does not provide the transformation you need, you can implement user defined functions, making it extensible for just about any data programming task.
Hive provides SQL-access to structured HDFS data using its HiveQL dialect. It's often the language of choice for SQL-programmers as they transition to Hadoop. Like pig, Hive generates a series of MapReduce jobs to produce its results. Hive queries are currently select only and not meant for interactive BI applications. However new initiatives like Stinger hope to make Hive a highly performant query engine that also supports ACID transactions.
Spark is perhaps the most exciting distributed processing engine in the Hadoop ecosystem. It provides batch, interactive and stream processing all within a single framework. Writing big data programs using the Spark API for Python or Scala will enable greater development productivity as many 3rd party libraries can be easily integrated without the need to switch between languages. Spark's use of in-memory processing promises quantum performance improvements over disk-intensive MapReduce based languages.
Pentaho Data Integration
Pentaho Data Integration (PDI) provides unique capabilities for big data programmers. Long known for its excellent ETL capabilities, PDI provides integrated job orchestration between Hadoop and non-Hadoop processes as well as a graphical development environment for MapReduce programming -- enabling the full semantic of its ETL language for mapper and reducer programs.
HCatalog itself is not a processing technology but is an essential part of enabling consistent data management across differing processing engines. Organizations often utilize more than one processing language within Hadoop. HCatalog enables each processing language to share a common tabular view of the data. It's essential for avoiding a data mess.
Accessing data in Hadoop
You've ingested and processed your data. Now you want to actually take a look at it. There are a variety of technologies that can be used to extract data from Hadoop, but the real action resides in high-performance querying of data in Hadoop.
Speed-of-thought analysis is a cornerstone to efficient business user data access. At present, Hive does not return queries “at the speed of thought” and it is not hard or unusual to write Hive queries which take 15 or 30 minutes to complete. Impala is a SQL engine which queries natively stored data in HDFS. It provides faster query response, by storing data in memory and bypassing the Hadoop MapReduce engine.
Spark SQL is another in-memory SQL engine which provides fast queries against native data in HDFS. It is less mature than Impala, but has potential architectural advantages stemming from its integration with the Spark in-memory computing engine and the rapidly evolving Spark ecosystem of technologies. We see tremendous potential for Spark SQL in the coming months and years.
Embedded Proprietary Databases
Numerous database vendors have enabled their engines to operate in cluster under YARN. They promise to bring greater query performance due to their unique distributed processing and data storage formats. Oh yes, these vendors often require you to reload your existing HDFS data into their proprietary formatted files...also stored in HDFS.
Most of our clients have existing orchestration technologies in place. Integrating Hadoop processes into existing workflows is a very normal part of the kinds of work we do. In some cases, we look to Oozie, which is widely used. In other cases, we find technologies like Pentaho to be very useful. Pentaho data integration has powerful features for job orchestration, scheduling, logging, and monitoring. It provides a more friendly and more powerful way to orchestrate complicated workflows within and outside of Hadoop. In simple cases, Hadoop orchestration can even be called with simple shell scripts, that can be integrated into other workflow processes.
Data formats and serialization
There are a wide variety of industry standard data formats available. Choosing the right one(s) for the job is the challenge. Some data format attributes are not too important for small and medium-size data sets, but become much more relevant in big data land. Namely, the ability to add fields post-hoc, parsing efficiency, storage efficiency, and feasibility of block compression.
The more ubiquitous formats are JSON, CSV, and XML -- which are widely used, though not necessarily optimal. Often, we find ourselves ingesting data in these common formats and then restructuring it into the following for better performance
Avro is fast and compact, features a binary data format with a flexible schema. Also, every Avro file carries with it a copy of the schema, which is convenient and contributes to robustness.
Parquet features a column organization and compression, which substantially contributes to performance for analytical queries. As a result, the schema is less flexible -- ie., there are fewer situations where you can mutate the schema without breaking compatibility with data files organized with an incrementally different schema.
Learning and Analysis
Python + notebook, and libraries
IPython notebook, with Pandas, NumPy, SciPy, Scikit-learn is an incredibly powerful tool set for data exploration and data analysis. Data can be brought to Python via Hive/JDBC or other means, trivially. Also, Spark offers an API which allows a data scientist to write arbitrary code in Python using arbitrary libraries which runs in-cluster, returning results to Python. This is an extremely productive toolset for data exploration, and prototyping of systems which are ultimately written for Hadoop or Spark.
There is a growing trend to run these deep kind of analytics in cluster. Many turn to libraries like Mahout, which is a library of machine learning techniques written for Hadoop (and some which are not written for Hadoop.) Additionally, there are libraries like MLIB, which is the Spark library for learning.