Inquidia is frequently asked by our clients whether Hadoop can replace their existing RDBMS-based data warehouse and/or large data marts. Of course, being good consultants, we often respond with a series of more detailed questions related to data volume, variety and velocity. If you don’t have “big data”, then you probably don’t need Hadoop. However, many of our clients give responses that suggest Hadoop should be considered. This begs the question: How to get started with Hadoop quickly and with minimal risk?
Replace your RDBMS with Hadoop
A frequent pain point of RDBMS data marts is slow performance due to increasing data volumes. When end user queries begin to crawl, it’s often time to find a new platform. Certainly, there are non-Hadoop options: buy a larger RDBMS server, acquire a purpose-built analytic database technology like Vertica, Actian, Exasol, etc. But, a growing trend is to try the rapidly improving SQL on Hadoop options.
Cloudera Impala, Hortonworks Hive Stinger.next, Presto, Apache Drill, and Spark SQL are among a growing list of technologies aimed at providing interactive query against Hadoop-scale data. These technologies are maturing rapidly. In fact, Cloudera Impala and Hive both recently added analytic windowing functions and improved SQL support --- reducing feature gaps with classic relational databases. New file formats like Parquet and ORC have been invented to compress data and improve query performance. Also, an ever increasing amount of documentation, case studies and benchmarks continues to legitimize these technologies.
However, our clients still ask, “All this is well and good, but how can I quickly determine if these technologies will provide the functionality and performance for my analytic and data access requirements?” Enter Inquidia’s Hadoop Datamart Importer (HDI).
Rapidly Evaluate High Speed Query Access In Hadoop
We created HDI to help our clients rapidly migrate a set of datamart tables (e.g. a star schema) into Hadoop and then enable query access through Impala or Hive. With HDI, our clients can quickly assess whether SQL on Hadoop options solve their performance issues -- without having to invest a lot of time and money training developers on Hadoop and migrating ETL processes to MapReduce (or Pig or Hive or …).
HDI utilizes database metadata to automate the copy of tables from an RDBMS into Hadoop (HDFS). The utility not only migrates the data, but also stores it in a format optimized for query performance. Additionally, it creates the Hive and Impala schemas to query this data. Within hours you can have query access to a copy of your data warehouse in Hadoop.
Beyond the primary “proof-of-concept” use case, HDI can be applied to other common migration problems. It can be used to perform the one-time migration of historical data - avoiding the manual development required for a seemingly mundane task. Also, the scripts that HDI generates can easily be modified to enable incremental processing, allowing you to automate the development of ongoing Hadoop ingestion processes.
How does the Hadoop Datamart Importer Work?
HDI is a parameter driven code generation and execution framework that imports data from RDBMS tables into Hadoop/HDFS, optimizes the storage of HDFS data and enables query access via Hive and/or Impala. HDI first generates Sqoop scripts to automate the import of data into Hadoop from a set of tables residing in a source RDBMS. A second set of optionally generated scripts converts this data into the Parquet file format for optimal query performance. And, a third set of scripts are generated to create the Hive/Impala tables for query access. HDI uses the Pentaho Data Integration platform to generate the scripts and orchestrate their execution.
When generating the Sqoop script, the utility automatically handles several challenges inherent to Sqoop. First, it automatically determines a “split by” column for each extracted table. Ideally, the “split by” column contains uniformly distributed data. Since most modern database designs utilize surrogate primary keys, if HDI detects a primary key column, it will use it. If no primary key is found, HDI selects a column based on semi-intelligent data type rules. A future version of HDI will identify data distributions to inform column selection. Second, HDI ensures Sqoop imports date columns as strings instead of the default “seconds since epoch” integer. This makes data columns much more friendly to query.
HDI Uses High Speed Data Formats
For optimal query performance, we’ve added a process to convert the data into a high-speed format. Sqoop only imports into Text, Sequence, or Avro file formats. However, the most optimized formats for query performance are Parquet or ORC. Thus, the second script that HDI generates converts the Sqoop imported Avro files into Parquet format through the use of a packaged MapReduce jar. This second script can be, optionally, turned off when running HDI -- especially if you do not intend to use Impala.
The final step in the process queries the database metadata to get a list of fields, maps the data types to Hive and Impala compatible data types, and automatically creates the Hive and Impala DDL scripts for table creation.
As stated, HDI can be used to only generate scripts or it can be configured to also execute these scripts. If you do not need to customize or manually execute the three sets of scripts, HDI can be a one-step process. You simply set configuration parameter, launch and sit back while it migrates your data to Hadoop.
Once your data has been migrated, optimized and defined via Hive/Impala tables, you can immediately connect your existing reports and analytics tool of choice such as Tableau or Pentaho to begin analyzing your data. You can quickly verify if SQL on Hadoop can provide the functionality and performance required to replace your data warehouse.
Getting Started with HDI
Given this is an early release, Inquidia Hadoop Datamart Importer is available via Inquidia Consulting services. If you’d like to learn more about it, give us a call at (312) 863-8660 or send an email to firstname.lastname@example.org.
You can be assured that we’ll continue to evolve and extend HDI. Stay tuned to our blogs for more updates.