Feed on

Pentaho announced this morning that they were going to be adding some features to Pentaho Data Integration (Kettle) and to their BI suite to make it easy for people to use Kettle to retrieve, manipulate, and store data in Hadoop, and to integrate Hadoop communication into the reporting and analysis layer.

They posted a nice five minute screencast on their Hadoop landing page demonstrating a couple of pieces of Hive integration.  In it, they retrieve data using Hive, and they also use a Hive user defined function that is implemented as an embedded Kettle transformation.

I’m very excited to see this announcement.  Besides the significant work we’ve been doing on the Metrics team to integrate HBase into the Socorro project, we also have major plans for our Hadoop clusters for general data storage and processing.

Right now, we have Kettle jobs and transformations that manipulate gigabytes of data per hour, loading it into our data warehouse.  One of the things I love about Kettle is the ability to quickly and easily define, review, and extend complex jobs such as our end-of-day data aggregation:

In the future, as we have more data stored in Hadoop, I want to be able to run transformations on that data.  Sometimes, if the transformations involve lots of RDBMS work, I’ll want to be streaming the data out of HDFS.  For other types of transformations that involve mostly business logic and text transformations, being able to run that code directly in a Hadoop Map Reduce job will be a fantastic feature.

My personal feeling is that people in the Hadoop community really need something visual and flexible like the Kettle interface for defining and manipulating this type of business logic.  Great strides have been made with projects such as Cascading, but it is still raw code, and I feel that excludes a lot of people who could be getting work done faster and better if they had a good tool to help them adapt to the world of Map Reduce.

Currently, someone can start up Kettle’s GUI and start constructing jobs and transformations simply by piecing together steps of work such as reading a set of text files, performing a regex on them, doing some value lookups, then aggregating the data.  If they could then save that transformation and execute it as a Hadoop Map Reduce job, I think it will be revolutionary for both worlds of ETL and Hadoop.

When Mozilla Metrics starts tackling some of the Hadoop data processing jobs that we have scheduled, we’ll be making significant open source contributions to both communities to realize this vision, and I really hope that it will help widen the accessibility of Hadoop to a new group of potential users.

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.