Extract, Transform, and Load (ETL) is a data warehousing process that uses batch processing to help business users analyze and report on data relevant to their business focus. The ETL process extract data out of the source, makes changes according to pre-defined rules, and loads the transformed data into a database or BI platform. ETL tools are becoming increasingly popular in a modern data warehouse architecture because the volume of data, as well as its structure, is increasing drastically.
A modern ETL solution requires a system that supports importing a vast array of enterprise on premise and web-based data sources into the cloud data warehouse. New data sources are becoming available constantly, so modern ETL solutions need to be flexible and well-maintained/tested. They need to be able to handle schema changes and structured and semi-structured data.
When it comes to ETL and open source, many solution are offered by vendors also selling their enterprise products or services. There are nevertheless other open source ETL tools maintained and operated by a community of developers, especially within the Apache Foundation ecosystem.
Here is our list of the leading ETL tools ready for the enterprise:
Talend Open Studio: Provided as a packaged, out-of-the-box, ready-to-install platform, Talend Open Studio is one of the most used solution. Please note that this is an enterprise supported open source software. The community of users is pretty large but there is not a large community of contributors. Talend is a NASDAQ-listed company and make more than 100 MUSD of turnover.
Pentaho Kettle: Pentaho Kettle is a set of open source tools that will all you to manipulate data from various databases. It gives a graphical user environment to describe what you want to do not and how you want to do it. As for Talend, please note that this is an enterprise supported open source software, the community of users is pretty large but there is not a large community of contributors.
There are other open source ETL software worth mentioning:
Apatar is an open source data integration and ETL tool written in Java, with powerful Extract, Transform and Load capabilities, that enables anyone to join their on-premise data sources with the Web without coding.
Enhydra Octopus is an advanced relational Extraction/Transformation/Loading tool. It can connect to JDBC data sources and perform transformations defined in XML definitions. JDBC drivers for CSV and XML are included. Octopus supports Ant and JUnit.
Please note that CloverDX (formally CloverETL) is no more open source.