My company, Inquidia Consulting, recently participated in Pentaho World 2014 in Orlando. Pentaho's the successful Data Integration & Business Analytics software vendor we've partnered with for nine years – almost since they cut their teeth with open source BI in 2004. By all accounts the conference was a success, the Inquidia attendees no doubt happy with both the technical and business development opportunities presented throughout the two-plus days.
Though I didn't attend the conference, I had the opportunity to spend an hour on a call with Pentaho CEO Quentin Gallivan and Cloudera Chief Strategy Officer Mike Olson immediately following their first day keynotes.
Both execs were giddy about the current opportunities in the big data and analytics markets. Gallivan noted a sea change with big data in just 12 months. Last year, “it was really talking about just big data technology - - data integration, how to move data, how to transform data.” Now, “the pendulum has swung. They were all talking about big data use cases.”
One application of particular interest to Pentaho revolves on personalization vendor RichRelevance. “So you get a lot of use cases around big data analytics for the restaurant industry or the oil and gas industry or the shipping industry. And so what they really do is they’re a personalization, reconciliation, customer engagement platform, helping equalize for all other retailers the kind of power that Amazon has.”
Hadoop Integration... And More
Cloudera's Olson stressed the collaboration/interoperability of Hadoop. The “other thing I say, is that it’s also not the all Hadoop all day show. Clearly Cloudera is a big bettor on Hadoop. But most of the vendors today talked about integrated approaches to data. So where’s it coming from? How do we bring different repositories together? How do we move data to the right place and build cross platform pictures of the data. And then build analytics on that.“
Both execs also emphasized the ever-growing “big” in big data. Gallivan mentioned two current Pentaho customers, the first generating 5 billion transactions and 500 terabytes of new information every day, the second with 30 billion transactions and 3 billion new records daily.
Olson went even further: “What I think is most exciting is that I think that this data flood is only beginning now to lap at our ankles....It may be that the aggregate amount of data in the world is going to grow by a factor of 10 in the short term. But it’s not going to be evenly distributed...large enterprises are going to see explosions in the data volumes they work with. While maybe some SME type apps see relatively less.”
The point of departure for Gallivan's Pentaho is the data refinery, in which its flagship Data Integration platform sits at the juncture of traditional databases and Hadoop, “blend(ing), enrich(ing) and refin(ing) any data source into secure, analytic data sets.”
Indeed, Pentaho appears to be focusing more on its data integration advantages now. “Well what we see you doing - what we want you to do is basically, the heavy lifting around the data preparation, the data manipulation, the data profiling and the delivery of the data - so that we can use any tool that we want.... And so we came up with this concept of a data orchestration platform for the data scientist.”
Tomorrow's Applications Approach
Like Gallivan, Olson feels the data refinery must be open to the disparate analytics tool preferences – Spark, Python, R, SAS, et al – of data scientists. In addition, though, he thinks the big data market is primed for an explosion in Hadoop-based analytics applications. “There are fewer of those applications. Stand by though. I think that this year or next year we’re going to see a real proliferation there. There’s a lot of value to be delivered and therefore, a lot of money to be made. And so we’re seeing a lot of exciting stuff go down.”
An area of increasing interest in the big data/analytics space is the interplay of Hadoop and the data warehouse. A few months ago, I wrote on the controversy surrounding the seemingly innocuous Cloudera marketing claim “CLOUDERA-BIG DATA – Turbocharge your data warehouse”. Many in the traditional DW community, including patriarch Bill Inmon, cried foul, “There simply is not the carefully constructed and carefully maintained infrastructure surrounding Big Data that there is for the data warehouse.”
My take was that the Cloudera statement was innocent marketing, and that DW dogma indeed needs modernizing, agreeing with Barry Devlin, “What we need to do is to look at how the data warehouse must evolve and how big data technologies and BI technologies interact to create a new biz-tech ecosystem where all information is maintained according to its value and its meaning.” As Hadoop technologies progress in SQL functionality, they could well become disruptive in the DW space, replacing expensive proprietary DW databases with much lower cost Impala or other Hadoop SQL dialects.
Olson, though, was ever conciliatory, noting that Cloudera had come to peace with Inmon and is also now collaborating with DW stalwart Teradata. He opines that both Hadoop and traditional database/ETL technologies have their roles in the modern DW. “A lot of the ETL and data prep is being moved out of the big data warehouse and moving into Hadoop......remember Hadoop is cheap. And so you can keep a copy of the data there. What that that means - especially with Impala and search - is that you can engage your business users who have always ached to get their hands on that EDW data.” The analytic workloads currently staples of the DW, on the other hand, are “inherently single shared addressed space - got to be able to look at everything. And that just doesn’t parallelize well. So I don’t see that workload in the near term moving to this big scaled out shared nothing Hadoop-y style architecture.”
Gallivan sees Pentaho as the Swiss army knife serving both DW and Hadoop masters, “being kind of in the middle of these next generation architectures.” Discussing a big telephone company customer, he notes, “And so what they store is all the customer profile information, the customer account information. And then in Hadoop they collect all the customer click stream activity. They collect all the customer service kind of activity - it’s more the unstructured. They blend that data......And so we see it as a natural coexistence where both the EDW and the Hadoop infrastructure are critical components to a major customer engagement motion.”
At the end of our enjoyable discussion, both Gallivan and Olson paid homage to the open source that drives both Pentaho and Clouder platforms. Gallivan opined, “LAMP stack, analytics, big data - I mean everything from the product standpoint is open - leverage open sources. And everything from how to run your business is SaaS.” And Olson weighed in, “I think platform software wants to be open source. We got to figure out good ways to monetize it. But you’d be crazy out of your minds to build a proprietary platform infrastructure for business today, for the enterprise - in my view.”
Perhaps Olson best summarizes the big data enthusiasm of both execs: “Bottom line, a thousand times more data in the world. If some of us can’t figure out a way to keep making a living, then we should probably be replaced by smarter people.”