Big Data Preparation
Ingest, Manipulate, Integrate, Access, Model and Orchestrate
From ingesting and manipulating data to modeling, Pentaho decreases the time and complexity involved in preparing data for analytics. Pentaho weaves big data technologies like Hadoop and NoSQL with relational data warehouses, data marts, and enterprise applications to deliver integrated, analysis-ready data.
Simple visual tools to improve developer productivity
Pentaho includes a visual extract-transform-load (ETL) tool to load and process big data sources in the same familiar way as traditional relational and file-based data sources. Instead of writing Java programs or Pig scripts, Pentaho empowers less technical developers to design and develop big data jobs using visual tools - resulting in greater team productivity and efficiency.
Pentaho works with any semi-structured and unstructured data type, for example, parsing web log and application log files to extract useful data to gain powerful insights about customer behavior.
In addition, Pentahoโs visual interface enables calling of custom code, for example, to analyze images and video files to extract meaningful metadata for identifying people and places.
Pentaho also provides visual data modeling capabilities, making it quick and easy to deliver an end-user friendly view of the data source.
Visual job orchestration
Pentaho provides a rich graphical design tool for orchestrating the execution of jobs in Hadoop, NoSQL and high performance analytic databases, as well as traditional data stores.
Orchestration capabilities include conditional checking steps, event waiting steps, execution steps and notification steps. Together these steps can be combined to enable easy visual assembly of extremely powerful job flow logic, across multiple jobs and data sources.
Pentaho also integrates with Hadoop-native utilities such as Oozie, an open source workflow/coordination service to manage data processing jobs for Apache Hadoop. This integration is key for companies who have already defined Oozie jobs but would like to migrate over to a visual, no-programming environment like Pentaho.
Processing data volumes and varieties with speed
Pentaho has powerful and innovative capabilities to process massive data volumes within constrained time windows such as:
- High performance data flow engine โ With a multi-threaded parallel processing architecture and in-memory data caching, Pentaho Data Integration (PDI) provides a world-class enterprise-scalable data integration platform ideal for handling the largest big data challenges.
- Cluster support โ PDI may be deployed in a cluster, enabling distributed processing of jobs across multiple nodes in the cluster.
- Run as Hadoop MapReduce โ Pentaho's small footprint and Java-based data integration engine is unique in its ability to execute as a Hadoop MapReduce job, running on every node in a Hadoop cluster of any size with up to thousands of nodes. Pentaho's support for Hadoop's distributed cache makes deployment of Pentaho across the cluster automatic and seamless.
Instant and interactive analytics
Provides immediate access to data inside Hadoop, NoSQL or other big data stores, and with interactive analysis, rich visualization and data discovery.
Learn more about big data analytics