Experimenting with Hive Insert/Update/Delete and Pentaho: What works, what doesn’t, and what you should avoid
In December of last year, Hive 0.14 was released and introduced transactional tables that allow insert, update, and delete statements. Prior to this release it was not possible to issue an update or delete command against a Hive table. It was also only possible to insert if you were overwriting the entire table or partition, it was not possible to insert an individual record. This left us with the strategies available in my Hadoop: How to Update without Updateblog post which are challenging to implement.
For the past several years, Pentaho Data Integration (PDI) has been able to query data from Hive tables, execute HiveQL statements, and write to HDFS files that underlie a Hive table. Like any standard ETL tool, PDI also has table output, update, and delete steps for classic data manipulation. However, these steps never worked with Hive because Hive did not support record-level inserts, updates or deletes. With the new capabilities in Hive 0.14, I thought I would use PDI to test them out.
The first thing I tested was using the Table Output step, which inserts data into tables. This step contains the capability to use batch inserts, which follows the Hive recommendations for inserts, updates, and deletes perfectly. However, when I configured this step to write to a Hive table, the test failed. Pentaho uses the syntax “insert into table_name…”, but Hive requires the syntax “insert into table table_name…”. This means that you still cannot use the Table Output step with Hive. Since PDI generates inserts without the Hive required “table” keyword, its Insert/Update, Dimension lookup/update, and Combination lookup/update steps will not work with Hive.
Unlike the Table Output step, the Update step did work. However, don’t be fooled. The Update step includes a “Use batch updates?” option, but Hive still processes every record with an individual update command. And, since each update statement in Hive runs a MapReduce job, this means processing row at a time updates is very slow. In my tests, each record / statement took 30 seconds. This adds up quickly and is impractical if you are updating more than a few records.
Additionally, every insert/update/delete statement in Hive creates its own delta file. If you process 30 records through the Update step in Pentaho you will generate 30 delta files. As these delta files start to pile up the query performance against the Hive table degrades. Hive does include a process that compacts these delta files periodically. Running compactions more frequently can alleviate the delta file problem, but compactions also take cluster resources and have a negative impact on Hadoop performance.
Like the Update step, the PDI Delete step works. One delete statement is still sent per row on the stream, therefore the same performance challenges occur.
Real World Use
Beyond my limited test cases can the new ACID features in Hive be used with Pentaho? Yes, but there are limitations and they require very careful design considerations. Some of the limitations include:
- ACID features in Hive only work with tables based on ORC files. Since Cloudera Impala does not support ORC, these tables will not be accessible from Impala.
- Any Pentaho steps that execute an insert statement will not work.
There are also several design considerations to ensure the best performance possible:
- Since Hive runs a MapReduce job for ever insert/update/delete statement, the Update and Delete steps should not be used unless updating or deleting a very small number of records.
- Running bulk updates/deletes (update table set employees=0 where company in (‘a’,’b’,’c’,’d’)) is better than running 4 update statements, one for each company. Statements like this can be executed using the Execute Row SQL Script step or the SQL job entry.
- Even without the ability to use the Table Output step, data can still be inserted into tables using the Execute Row SQL Script step and SQL job entry. However batch inserts are preferred over single row inserts. Structure the insert statement like insert into table company (company,employees) values (‘a’,10),(‘b’,20) instead of running an insert statement for each company.
With the right design, using the new ACID capabilities in Hive is an attractive option for updating and deleting data in Hadoop. However, care must be taken to account for performance of the DML statements and to limit the impact of proliferate delta files on query performance.
Editors note 3/10/15:
After the writing of this blog, I learned that Pentaho has fixed the insert into table syntax that was preventing the Table Output step from working. This patch has not been formally released. However, after testing this patch I found that the same performance challenges found in the update and delete steps also apply to the insert steps.