Updated Avro Plugin for PDI Version 2.1.0

avro
Friday, October 31, 2014 - 08:45

The Avro Output Plugin for Pentaho Data Integration allows you to output Avro files using Kettle. Avro files are commonly used in Hadoop allowing for schema evolution and truly separating the write schema from the read schema.

System Requirements

-Pentaho Data Integration 5.0 or above

Installation

Using Pentaho Marketplace

  1. In the Pentaho Marketplace find the Avro Output plugin and click Install
  2. Restart Spoon

Manual Install

  1. Place the AvroOutputPlugin folder in the ${DI_HOME}/plugins/steps directory
  2. Restart Spoon

Usage

Schema Requirements

Arrays are not supported by this step. It is currently not possible to output an Avro array using the Avro Output Plugin. All other Avro types are supported including complex records.

File Tab

  • Filename - The name of the file to output
  • Automatically create schema? - Should the step automatically create the schema for the output records?
  • Write schema to file? - Should the step persist the automatically created schema to a file?
  • Avro namespace - The namespace for the automatically created schema.
  • Avro record name - The record name for the automatically created schema.
  • Avro documentation - The documentation for the automatically created schema.
  • Schema filename - The name of the Avro schema file to use when writing.
  • Create parent folder? - Create the parent folder if it does not exist.
  • Include stepnr in filename? - Should the step number be included in the filename? Used for starting multiple copies of the step.
  • Include partition nr in filename? - Used for partitioned transformations.
  • Include date in filname? - Include the current date in the filename in yyyyMMdd format.
  • Include time in filename? - Include the current time in the filename in HHmmss format.
  • Specify date format? - Specify your own format for including the date time in the filename.
  • Date time format - The date time format to use.

Fields Tab

  • Name - The name of the field on the stream
  • Avro Path - The dot delimited path to where the field will be stored in the Avro file. (If this is empty the stream name will be used. If the schema file exists and is valid, the drop down will automatically populate with the fields from the schema.)
  • Avro Type - The type used to store the field in Avro. Since Avro supports unions of multiple types you must select a type. (If the schema file exists and is valid the drop down will automatically limit to types that are available for the Avro Path selected.)
  • Nullable? - Should the field be nullable in the Avro schema. Only used if "automatically create avro schema" is checked.
  • Get Fields button - Gets the list of input fields, and tries to map them to an Avro field by an exact name match.
  • Update Types button - Based on the Avro Path for the field, will make a best guess effort for the Avro Type that should be used.

Changelog:

  • Enhancement 1: Added Snappy and Deflate compression support. Users can now choose to write their Avro files using Snappy or Deflate compression. #1
  • Issue 2: Fixed bug where Avro schema file name was not being saved correctly when using a repository. #2
  • Issue 7: Fixed bug where Pentaho was not releasing the hold on the Avro schema file in certain cases. #7
  • Not Tracked: Fixed bug where the Avro Schema Filename browser was not filtering correctly for .avsc and .json files.
  • Not Tracked: Fixed bug where the Get Fields button when "Automatically create schema" was not checked was doubling the $. in the Avro path.

Get the plugin from github here!

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.