Extract, Transform, and Load (ETL) is a data warehousing process that uses batch processing to help business users analyze and report on data relevant to their business focus. The ETL process extracts data out of the source, makes changes according to pre-defined rules, and loads the transformed data into a database or BI platform. ETL tools are becoming increasingly popular in modern data warehouse architecture because the volume of data, as well as its structure, is increasing drastically. A modern ETL solution requires a system that supports importing a vast array of enterprise on-premise and web-based data sources into the cloud data warehouse. New data sources are becoming available constantly, so modern ETL solutions need to be flexible and well-maintained/tested. They need to be able to handle schema changes and structured and semi-structured data. When it comes to ETL and open source, many solutions are offered by vendors also selling their enterprise products or services. There are nevertheless other open-source ETL tools maintained and operated by a community of developers, especially within the Apache Foundation ecosystem.
Here is the list of the top 4 leading open source ETL tools ready for the enterprise:
- Apache Airflow: Apache Airflow is a platform that allows you to programmatically author, schedule, and monitor workflows. The tool enables users to author workflows as directed acyclic graphs (DAGs). The airflow scheduler executes tasks on an array of workers while following the specified dependencies. Airflow provides rich command-line utilities that make performing complex surgeries on DAGs simple. The user interface also provides capabilities that enable users to visualize pipelines running production, monitor progress, and troubleshoot issues when needed.
- Apache NiFi: Apache NiFi is a system used to process and distribute data, and offers directed graphs of data routing, transformation, and system mediation logic. NiFi features a web-based user interface that enables users to toggle between design, control, feedback, and monitoring. It is highly configurable (dynamic prioritization, back pressure, flow modification at runtime), and can be designed for extension. NiFi also offers multi-tenant authorization and internal authorization and policy management.
- Apache Spark: Apache Kafka is a distributed streaming platform that enables users to publish and subscribe to streams of records, store streams of records, and process them as they occur. Kafka is most notably used for building real-time streaming data pipelines and applications, and is run as a cluster on one or more servers that can span more than one datacenter. The Kafka cluster stores streams of records in categories called topics, and each record consists of a key, a value, and a timestamp.
- Talend Open Studio: Provided as a packaged, out-of-the-box, ready-to-install platform, Talend Open Studio is one of the most used solutions. Please note that this is an enterprise supported open-source software. The community of users is pretty large but there is not a large community of contributors. Talend is a NASDAQ-listed company and makes more than 100 MUSD of turnover.
There are other open-source ETL software packages worth mentioning:
- Pentaho Data Integration (formerly Pentaho Kettle): Pentaho Data Integration is a set of open-source tools that will all you to manipulate data from various databases. It gives a graphical user environment to describe what you want to do not and how you want to do it. As for Talend, please note that this is an enterprise supported open-source software. Hitachi acquired Pentaho and seems to continue to invest. The community of users is pretty large but there is not a large community of contributors.
- Scriptella: Scriptella is an open-source ETL and script execution tool written in Java and focused on simplicity.You just use SQL (or other scripting languages suitable for the data source) to perform required transformations.
There is other ETL open-source software you can hear about, and not listed here because they are deprecated or closed source:
- Apatar: Apatar was an open-source data integration and ETL tool written in Java, with powerful Extract, Transform, and Load capabilities. The software is no more maintained with the last release dated from 2013.
- Enhydra Octopus and GeoKettle, a “spatially-enabled” version of Pentaho Data Integration (also known as Kettle), are no more supported.
- CloverDX (formerly CloverETL) is no more open source.