Inquidia Labs Presents the PDI Token Replacement Plugin

pentaho token replacement
Tuesday, December 2, 2014 - 08:00

Inquidia Consulting was an early adopter and remains long-time proponent of the Pentaho Data Integration (PDI) platform, having successfully used and deployed it on 100’s of projects over the past 8 years. It has extremely good performance, a highly productive development environment, a ubiquitous deployment capability and the unique ability to work with classic relational and emerging big data platforms. But, perhaps its best attributes are its extensible architecture and an actively engaged community of open-source developers.

PDI provides a plug-in architecture that allows extension of the core functionality through the development of new data transformation steps. If a transformation does not exist for your unique requirement, then you can use the plugin architecture to create a new, reusable component. Additionally, Pentaho supports a marketplace whereby these plugins can be shared with the broader community. The developer simply publishes the new plugin and, voila, it is available for all to download. Over the years, we have published several plugins to the marketplace. Today, we announce yet another: the Inquidia Token Replacement plugin.

The Token Replacement plugin enables a template-based approach for generating large (or small) string/text objects. You provide the step with a string template containing named tokens, then map the tokens to fields from a stream of input records. The step will generate a newly constructed, token-replaced string for each input record.

Necessity is the Mother of Invention

In our work with clients, we frequently use PDI for more than pure ETL. It is also an orchestration engine with tremendous ability to be “metadata-driven”. We recently leveraged this with our Hadoop Datamart Importer (HDI) framework. Core to HDI is the ability to create Hive DDL -- one statement per table. The create table statements were functionally similar, only needing a few table-specific parameters. To do this in PDI, we had 4 options:

  1. Use a Series of Replace in String steps. For our HDI use-case, we’d need 4 or more consecutive Replace in String steps -- not elegant or maintainable.
  2. Use an Add Constants - Concat Fields step sequence. The constants would be the static string chunks of the statement which would have to be concatenated with the dynamic values coming from the database metadata -- again not elegant or maintainable.
  3. Use a Modified JavaScript step. We could construct the command by passing the database metadata into a JavaScript step and concatenating the static string components in java.
  4. Use a User Defined Java Expression step. Similar to the Modified JavaScript step but more performant. This is the option we used. The code chunk below illustrates the java expression that was required. It’s a lot of string appends that mix template/constant strings with variable strings. (We bolded the variables for readability.)
  5. new StringBuilder().append(“create external table if not exists “).append(table_prefix).append(TABLE_NAME).append(“ ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’ stored as inputformat ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’ outputformat org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’ location ‘“).append(warehouse_dir).append(“/”).append(table_prefix).append(TABLE_NAME).append(“‘ tblproperties(‘avro.schema.url’=’hdfs://”).append(avro_schema_dir).append(“/”).append(TABLE_NAME).append(“.avsc’);”).toString()

    This string concatenation pattern is frequently encounter. In fact, shortly after developing HDI, we began prototyping a solution that was going to require concatenations to construct an entire file. If you think the example above is hard to read, you can only imagine how much worse it looks to create an entire file! Clearly there had to be a better way.

    Enter the Inquidia Token Replacement Plugin for Pentaho Data Integration

    At the core of our problem is simple token replacement. We want to define a “static” template string which can be dynamically customized by inserting “variable” input strings. This is done through the use of tokens. The code box below shows how we’d construct a template for the Hive table DDL example (Tokens have been bolded):

    create external table if not exists {table_prefix}{table_name}
    row format serde ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
    stored as input format ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
    outputformat ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
    location ‘{warehouse_dir}/{table_prefix}{table_name}
    tblproperties(‘avro.schema.url’=’hdfs://{avro_schema_dir}/{table_name}.avsc’);

    Notice that we don’t have to deal with all of the java syntax to define our template. There are no append method calls and we don’t have to concern ourselves with proper quoting. This is much easier to read and develop!

    The Token Replacement plugin allows you to read your template string from a file, a field on the stream, or text you enter into the step. It performs token replacement on this template string before outputting a new string to either a file or a new field on the output stream. Tokens can be flexibly defined using opening and closing characters of your choosing -- completely configurable.

    Let us know what you think

    We made the Inquidia Token Replacement plugin freely available here via Github and through the Pentaho Marketplace. Please download, try it out and let us know what you think.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.