Here is the list of the top 4 leading open-source ETL tools ready for the enterprise:
- Apache Airflow: Apache Airflow is a platform that allows you to programmatically author, schedule, and monitor workflows. The tool enables users to author workflows as directed acyclic graphs (DAGs). The airflow scheduler executes tasks on an array of workers while following the specified dependencies. Airflow provides rich command-line utilities that make performing complex surgeries on DAGs simple. The user interface also provides capabilities that enable users to visualize pipelines running production, monitor progress, and troubleshoot issues when needed.
- Apache NiFi: Apache NiFi is a system used to process and distribute data, and offers directed graphs of data routing, transformation, and system mediation logic. NiFi features a web-based user interface that enables users to toggle between design, control, feedback, and monitoring. It is highly configurable (dynamic prioritization, back pressure, flow modification at runtime), and can be designed for extension. NiFi also offers multi-tenant authorization and internal authorization and policy management.
- Apache Spark: Apache Kafka is a distributed streaming platform that enables users to publish and subscribe to streams of records, store streams of records, and process them as they occur. Kafka is most notably used for building real-time streaming data pipelines and applications and is run as a cluster on one or more servers that can span more than one data center. The Kafka cluster stores a stream of records in categories called topics, and each record consists of a key, a value, and a timestamp.
- Talend Open Studio: Provided as a packaged, out-of-the-box, ready-to-install platform, Talend Open Studio is one of the most used solutions. Please note that this is enterprise-supported open-source software. The community of users is pretty large but there is not a large community of contributors. Talend is a NASDAQ-listed company and makes more than 100 MUSD of turnover.
There are other open-source ETL software packages worth mentioning:
- Pentaho Data Integration (formerly Pentaho Kettle): Pentaho Data Integration is a set of open-source tools that will all you to manipulate data from various databases. It gives a graphical user environment to describe what you want to do not and how you want to do it. As for Talend, please note that this is enterprise-supported open-source software. Hitachi acquired Pentaho and seems to continue to invest. The community of users is pretty large but there is not a large community of contributors.
- Scriptella: Scriptella is an open-source ETL and script execution tool written in Java and focused on simplicity. You just use SQL (or other scripting languages suitable for the data source) to perform the required transformations.
There is other ETL open-source software you can hear about, and not listed here because they are deprecated or closed source:
- Apatar: Apatar was an open-source data integration and ETL tool written in Java, with powerful Extract, Transform, and Load capabilities. The software is no more maintained with the last release dated 2013.
- Enhydra Octopus and GeoKettle, a “spatially-enabled” version of Pentaho Data Integration (also known as Kettle), are no more supported.
- CloverDX (formerly CloverETL) is no more open source.