Pentaho Data Integration (Kettle)

Welcome to the community home for Pentaho Data Integration Community Edition (PDI CE) also known as Kettle. Pentaho Data Integration delivers powerful Extraction, Transformation and Loading (ETL) capabilities using an innovative, metadata-driven approach. With an intuitive, graphical, drag and drop design environment, and a proven, scalable, standards-based architecture, Pentaho Data Integration is increasingly the choice for organizations over traditional, proprietary ETL or data integration tools.

Community Edition is self supported open source software. An Enterprise Edition (EE) of Pentaho Data Integration including technical support, managed upgrades and enterprise features is also available. For more information about EE or for screenshots and datasheets, visit Pentaho Data Integration EE on Pentaho's corporate site.

Recent News and Releases

- 2012-04-20 - Stable build of Kettle 4.3 released: More info.
- 2011-09-12 - Stable build of Kettle 4.2 released: download now.
- 2011-07-01 - Release Candidate 1 of Kettle 4.2 released: download now.
- 2010-11-30 - Stable build of Kettle 4.1 released: download now.
- 2010-11-30 - Kettle Agile BI Plugin 1.0.2-stable: download now.

Stable

Pentaho Data Integration 4.2.0 stable
This is a stable build of Pentaho Data Integration (Kettle) 4.2.0. New features:

Excel Writer step has advanced output functionality to control the look and feel.
Graphical performance and progress feedback for transformations
Google Analytics step allows download of statistics from your Google analytics account
Pentaho Reporting Output step makes it possible for you to run your (parameterized) Pentaho reports in a transformation. It allows for easy report bursting of personalized reports.
Automatic Documentation step generates (simple) doc of your transformations and jobs.
Get repository names step retrieves job and transformation information from your repositories.
LDAP Writer step
Ingres VectorWise (streaming) bulk loader step
Greenplumb (streaming) bulk loader step (for gpload)
Talend Job Execution job entry
Healthcare Level 7 : HL7 Input step, HL7 MLLP Input and HL7 MLLP Acknowledge job entries
PGP File Encryption, Decryption & validation job entries.
Single Threader step for parallel performance tuning of large transformations
Allow a job to be started at a job entry of your choice (continue after fixing an error)
MongoDB Input step (including authentication)
ElasticSearch bulk loader
XML Input Stream (StAX) step to read huge XML files at optimal performance and flat memory usage by flattening the structure of the data.
Get ID from Slave Server step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server: See wiki doc for more info
Memory tuning of logging back-end with: KETTLE_MAX_LOGGING_REGISTRY_SIZE, KETTLE_MAX_JOB_ENTRIES_LOGGED, KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for never ending ETL in general and jobs specifically.
Multiway Merge Join step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.

Carte improvements:

reserve next value range from a slave sequence service
allow parallel (simultaneous) runs of clustered transformations
list (reserved and free) socket reservations service
new options in XML for configuring slave sequences
allow time-out of stale objects using environment variable KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES

Repository Import/Export:

Export at the repository folder level
Export and Import with optional rule-based validations
Import command line utility allow for rule-based (optional) import of lists of transformations, jobs and repository export files: See wiki doc for more info

ETL Metadata Injection:

Retrieval of rows of data from a step to the “metadata injection” step
Support for injection into the “Excel Input” step
Support for injection into the “Row normaliser” step
Support for injection into the “Row Denormaliser” step

Many bug fixes. See Release Notes for 4.2.0 for more info

- Downloads

- Source

- Documentation

- Forum

In Development

Developer Resources
- Roadmap	- Sprint Homepage	- Open Issues	- Continuous Integration Builds
- Source	- Documentation	- Developer Forum

PDI 4.4.0 (platform Release 5.0 - Sugar) - In Progress The primary goal of the PDI version 4.3 is Ease of Management with features for conducting Lifecycle Management along with significant improvements to Administration and Monitoring capabilities.
- Task Board for 4.4.0 GA		- Prod Management for 4.4.0		- JIRA Cases for 4.4.0		- Source (trunk)

Upcoming Training

Mastering Pentaho Data Integration
Pentaho BI Suite Bootcamp
See all Courses

Quick Links

- Frequently Asked Questions
- Online Documentation
- Matt's blog
- Case Studies
- Java API Examples
- Screenshots
- Recorded Demos
- Partners
- Get Support

Pentaho Advertisement

Contribute to the Project

You can participate by contributing new code, reporting bugs, testing new releases, answering questions and more; Email us the proposed contribution and any other relevant details. Welcome to the team.

- Write a tech tip
- Report a bug in JIRA
- Answer posts on the forums
- Write some code
- How to Contribute