The use of the Parquet file format in Hadoop has been steadily climbing since it was initially released and Inquidia now sees Parquet being used in some form in most of the Hadoop deployments we work with. A lot of this adoption has been driven by the compression and performance benefits of Parquet over a non-columnar file format such as text or Avro. In fact Parquet is the recommended file format for data that will be accessed via Cloudera Impala providing the best performance of any other option. Hive, Pig, and Spark will also see performance benefits from using Parquet. We discussed many of the benefits and challenges of using the Parquet file format in much more detail in our recent Hadoop File Formats: It's not just CSV anymore blog post.
However, despite this rise in adoption, it remains challenging to ingest data into Hadoop in the Parquet format. In fact it was only in November 2014 that the capability to write directly to Parquet files was added to Sqoop*. Other ingestion frameworks for Hadoop such as Flume do not currently include Parquet support. This means that in order to store data in Hadoop as Parquet files you first had to ingest the data into Hadoop as a text or Avro file and then run a MapReduce job to convert the data to Parquet.
Not being able to ingest data directly into the Parquet format is obviously a challenge, with the current solution requiring use of cluster resources to re-write the data after it has been ingested. We presented this challenge to the Inquidia Labs team and they went to work.
Inquidia Labs Presents the Parquet Output Plugin for Pentaho Data Integration
The Parquet Output Plugin for Pentaho Data Integration allows data to be written to HDFS in the Parquet file format without requiring intermediate steps. This allows for the ingestion from virtually any data source into Hadoop as Parquet files including sources such as Salesforce, Excel, SAP, text files, databases, and many more. Thus enabling optimal query performance directly on your ingested data.
The Parquet Output Plugin requires Pentaho Data Integration version 5.0 or higher. It is recommended that the Parquet Output plugin only be used to write data to HDFS. If you wish to output to a local disk instead of HDFS the Hadoop client must be installed on the machine you are using.
The Parquet page size is the minimum chunk of data that must be read from a Parquet file. Larger page sizes result in smaller Parquet files; however, a larger page size takes longer to read. The default page size is 1024 KB; however, Cloudera generally recommends you use a page size of 8 KB.
Parquet performs best when the entire Parquet file is stored in one HDFS block. Therefore it is recommended that you manage your data volumes such that each Parquet file written will fit into a single HDFS block.
Let us know what you think
Please download, try it out, and let us know what you think.
*Parquet support in Sqoop is added in version 1.4.6. As of this writing Sqoop 1.4.6 is not stable; however, CDH5.3 includes this patch and has Parquet support enabled in Sqoop.