Metadata Injection with Pentaho Data Integration

Monday, January 23, 2017 - 18:00

In November of 2010, Pentaho Data Integration (PDI) v4.1 introduced the concept of Metadata Injection.  The promise of creating generalized ETL transformations whose behavior can be changed at run-time became a reality.  However, this initial breakthrough was limited to just 4 steps.   Today, with the recent release of PDI 7.0, there are 74 supported steps, making Metadata Injection more powerful (and applicable) than ever before!

With Metadata Injection, developer agility and productivity accelerates.  Instead of creating and maintaining dozens of transformations built with a common pattern, developers define a single transformation template and change its runtime behavior by gathering and injecting metadata from property files or database tables.  At Inquidia, we’ve seen code lines shrink by as much as 50:1 through the use of Metadata Injection.

When is Metadata Injection Useful?

There is often a lot of confusion about the proper and best use of new technologies.  Metadata Injection is no different.   Over the past couple years, we have observed many situations which benefited from Metadata Injection and others where its use added unnecessary complexity.  Invariably, our assessments generated questions as to why and when Metadata Injection should be used. So, we decided to update our Inquidia Reflex best practices to codify our thinking.

We took stock of our development experiences and identified the following traits to indicate Metadata Injection’s use:

  1. There is a repeatable processing pattern,
  2. Whose variations are finite and declarative,
  3. And is required for a large number of transformations,
  4. That often have dynamic source(s) and/or target(s).

 

As we continue to innovate, we’d expect this list to evolve, but we think it is a great starting point.   Let’s elaborate on these traits through the use of two examples.  

Imagine you are tasked with creating an export process.   Each export will be sourced using a SQL SELECT statement.  The queried records will be placed into a delimited file with a specific  name and directory.   It’s clear that the process has a pattern. (#1)  It simply issues a select statement and write the rows out to an export file.   The variations between each export are well-defined.  They consist of a distinct SELECT statement, list of fields with formatting and other metadata, record delimiter, field delimiter, field closure, and filename/directory location.   None of the variations involve optional or new processing steps, rather they include declarative values used to configure existing steps.  (#2)   If you only need to do this for one or two exports, you might as well create one distinct transformation per export, but you have requirements for 20 exports and you know your users will be demanding more over time.  In fact, you expect some of the exports to be retired while others are added.  In short, you have volume. (#3)   Finally, since each export has a distinct source query and target file, you have dynamism. (#4)

Another common scenario involves loading a set of staging tables with delta data from source tables.   Staging tables are often loaded without much transformation, with each load process following the same pattern.  A query is issued to get changed data from the source table with the results inserted into the staging table.   The process is patterned and repetitive. (#1)  Variation is limited to metadata about the source and target tables (#2)   Most staging processes deal with dozens if not scores of tables, so there is usually volume.  (#3)  The source and target tables vary. (#4)

Admittedly, these are straightforward transformations that are easy to describe in a blog.  Rest assured that there are more complex patterns to be found.   For example, we’ve implemented big data ingestion processes that consume varying input feeds into standardized, yet customizable JSON or Avro files as well as complex change data capture patterns that require determination of logical deletes and target record versioning.   Regardless of how simple or complex, all of our experiences (to date) share the four traits outlined above.

A Metadata Injection Example

Enough of the generalized concepts, let’s look at some code!   We’ll expand on the staging load example outlined above.    Here is our repetitive, processing pattern for each staging table.

  1. A SQL SELECT is used to extract source data,
  2. Which is written to a delimited text file,
  3. That is bulk loaded into a target staging table.

 

Below is a screenshot of the PDI job for the overall process:  j_run_staging_jobs.

It starts by truncating all of the target staging tables via the t_truncate_tables transformation. (label 1; details not shown) This transformation obtains a list of staging table names, uses a Concat Fields step to construct a “TRUNCATE TABLE table_name;” SQL statement, then passes this statement to an Execute SQL Statements step.  Although this transform does not technically use Metadata Injection, it is utilizing metadata and PDI’s ability to dynamically construct and execute SQL statements.

After truncating, we use Metadata Injection within a processing loop to load each staging table (label 2)   The t_get_table_names transform generates metadata for each staging table to load.   The j_staging_table subjob executes for each staging table using Metadata Injection to extract source data, generate a file and bulkload into the target staging table.

A couple of assumptions for this example:

  • The source and target databases are both SQL Server.
  • The Metadata Injection technology is not restricted to sourcing and targeting only SQL Server tables. This same pattern could be adapted to source and target from just about any JDBC compliant database that has a bulk load step.
  • The source and target table structures are identical.
  • We did this for simplicity’s sake.  We can still use this same pattern if there is a simple mapping between fields or extra audit fields to be added.

 

The following diagram depicts the transform (t_get_table_names) which acquires initial metadata for each staging table.

The transformation is summarized as numbered:

  1. Query information_schema.columns to produce the metadata required for each table.
  2. Remove excess fields from the stream.
  3. Use a Group By step to generate a concatenated, comma delimited list of column names for each table.
  4. Copy each row to the result buffer for processing by the subsequent j_load_staging_table job.  (We’ll be using Metadata Injection within this job!)

 

The output rows for the “crtr: tables_columns” step (#4) will look something like:

table_name

field_list

table_1

column_1, column_2, column_3, column_4, column_5

table_2

column_1, column_2, column_3, column_4

table_3

column_1, column_2, column_3

table_...

column_...

You can see the metadata forming as we will use these fields to generate a SQL extract statement like SELECT ${FIELD_LIST} FROM ${TABLE_NAME}.  

Next, let’s examine the j_load_staging_table job which executes once per row returned from the t_get_table_names result set.  (i.e. one execution per table.)   The following diagram depicts and numbers the steps within this job.  

  1. Write an entry to the PDI file log denoting the parameters being used, i.e. the table and its column list.
  2. Use Metadata Injection to extract data from our source and generate a delimited file for bulk loading.
  3. Double-check that the data file was produced.
  4. Invoke a SQL Server bulk load process to populate the target table
  5. Delete the generated data file, freeing up space for the next iteration.

 

The t_staging_metainject transformation (pictured below) obtains metadata for the table being processed and injects it into the t_staging_table_load transformation template.    

The metadata to be injected is:

  • Source SELECT statement
  • Target (staging) table name
  • Target (staging) column metadata

 

The source SELECT statement can be constructed from the table_name and field_list input parameters.   In our example, the target table name will be named the same as the source.  The target column metadata is obtained via the ti: table_col_data_types Table Input step which uses the table_name input parameter to pull metadata from the staging database catalog.  

Once obtained, the metadata is passed into the “emi: t_staging_table_load” Metadata Injection step.   The metadata will look something like:

target_ table_ schema

target_ table_ name

column_ name

data_ type

format

length

precision

ordinal_ position

sql_stmnt

staging

customer

customer_id

Integer

0

<null>

<null>

1

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

city

String

<null>

<null>

<null>

2

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

state

String

<null>

<null>

<null>

3

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

number_fld1

Number

0

18

2

4

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

create_date

Date

yyyy-mm-dd

<null>

<null>

5

SELECT customer_id, city, state, number_field1 FROM dbo.customer

Before explaining how the Metadata Injection step is configured, let’s review the t_staging_table_load template transformation.   This is called a template because its steps are not completely configured, i.e. the steps will have their metadata injected at runtime.    The template transformation is pictured below.

This template is straightforward.  We will inject the source SELECT into the Table Input step (table_name_select) and the column metadata into the Text File Output step (bulk_load_file).   Note that we obtain the target record create_date using a Get System Info step.  There is no metadata injected into this step, but we do include metadata about the create_date column in the Text File Output step.

In the Inject Metadata tab of the “emi: t_staging_table_load” step, we’ll configure the Table Input step to use the SQL SELECT statement (sql_stmnt).  (Note that while sql_stmnt is repeated for every record in the input metadata, it is used only once and will be executed only once.)   We’ll also configure the Text File Output step with metadata using the table_name as the filename and the column metadata for the fields. The column metadata includes column name, format, length, precision, and data type.  

Voila!  We’re done.  All we need to do next is test and debug.   Alas, debugging metadata injection jobs with their dynamic runtime content is not easy.    Fortunately, PDI provides a helpful feature in the Metadata Injection step’s Option tab:

Setting this option will generate the runtime transformation file (ktr file) that is generated after the metadata has been injected.   If your tests don’t produce the desired results, you can open the generated ktr(s) in spoon and run them individually to find the bug -- a huge timesaver!

Stepping back, imagine that you have 100 staging tables to load.   Before metadata injection, you’d build the process for one table and then pay an army of programmers to copy and paste the remaining 99.   (Not to mention the maintenance costs!)   With Metadata Injection, you are loading 100 tables using just 2 jobs and 4 transformations.  BOOM!  

When Not to Use Metadata Injection

There’s always a flipside.   Let’s next consider when Metadata Injection’s use would not be appropriate.    We have found four indicators that deem a set of processes incompatible with Metadata Injection:

  1. Business rules vary between process instances
  2. Substantial and complex transformation of the data
  3. Slowly changing dimension table loads
  4. Fact table loads

We suspect that there may be other indicators, but as with our traits for adoption, we need to start with what we have found and adjust as we learn more.

The first indicator implies that we cannot define a template transformation that uses the same set of processing steps for each process instance.   We need the variation to be declarative, meaning we can inject a finite set of attributes into well-defined, templated steps.  If the processing requires different steps or step sequences, it will be hard, if not impossible, to implement using Metadata Injection.

Determining whether a transformation pattern is too complex for Metadata Injection is often driven by the availability of injectable steps.   One can imagine a repeatable pattern with declarative variations that is complex.   That said, there are steps in PDI that may not be compatible with Metadata Injection.  Before embarking on a Metadata Injection based design, confirm that all of the steps in your process template are supported.   The complete list of steps that work with Metadata Injection can be found on the Pentaho help site here.

As of the current PDI release, the Dimension Lookup/Update step (used for maintaining slowly changing dimensions) does not support Metadata Injection.  But even if it did, it’s entirely unlikely that you will find repeatable patterns between dimensions.   For example, a transformation for loading dim_customer will likely have vast differences with dim_product.

Considering fact tables, if a transformation template can accommodate the loading of more than one fact table, then there is a likely argument that the Fact table design is flawed.  Fact tables typically having varying dimensional grain and metric columns that are distinctly computed by the dimensional grain.  These features cause uniqueness of processing that is not suitable for Metadata Injection.

The Future of Metadata Injection

As PDI continues to grow, we can be certain that more steps will support Metadata Injection and the opportunity to apply its magic will grow.  But, heed the warning given to Peter Parker as he became Spiderman: “with great power comes great responsibility”.  Knowing when to apply this capability is just as important as knowing how to use it.  

With over ten years experience working with PDI across many different industries and domains you can be assured that Inquidia Consulting knows when and how to apply the dynamism and  might of Metadata Injection.    If you’d like more info on Inquidia ReflexSM , Metadata Injection and the rest of our data engineering services and how we can help you find success, drop us a note at info@inquidia.com.  We’d be happy to share our expertise to ensure your success.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.