Extract, Transform and Load: ETL
Market instability —increasingly volatile due to globalisation, diversification, expansion, etc.— makes ETL processes more necessary than ever. Why? The answer is simple: ETL processes are essential for businesses to make data-driven decisions instead of basing their business strategies on arbitrary decisions or assumptions. Thus, applying an ETL process means gaining control over data.
ETL processes have become increasingly important in the business world as Big Data has become increasingly important. Companies today accumulate large amounts of data that are necessary to carry out business operations, but which, if not managed well, can end up causing an overabundance of useless information or an overload for those in charge of analysing the information. In addition, unintegrated and poorly managed data can be a cost overhead for the business, while the opposite turns it into a business asset.
What is Extract, Transform and Load (ETL)?
If you work in a modern company, you have probably heard of ETL, although perhaps you are still not clear about what exactly it is. ETL (Extract, Transform, Load) is, in short, a process of storing, processing and managing data. The process allows the transfer of data from multiple sources to a single place, whether it is a data warehouse, a data lake, a relational or non-relational database...
The acronym stands for Extract, Transform and Load, which are basically the phases of the process.
Differences between ETL and SSIS
It is common for people to confuse the concepts ETL and SSIS thinking that they fulfil different functions or that they are similar processes but with differences. In reality, SSIS is a Microsoft SQL Server ETL tool that in addition to the ETL process allows other functions such as merging, cleansing or aggregating data and includes design tools, workflow functions, graphical tools, etc. In addition, SSIS simplifies the transition of data from one database to another and can extract data from a large number of data sources: SQL Server databases, Excel files, Oracle and DB2 databases, etc. As a Microsoft Power BI partner, Bismart works with this tool.
While ETL refers to the process of extracting, transforming and loading data (Extract, Transform and Load), SSIS is a tool that does just that: extracting, transforming and loading data from multiple sources into a single database such as a data warehouse.
More precisely, what ETL actually does is to read data from another database or from multiple data sources and convert it through lookup tables, applying rules or combining existing data with other data. Thus, the process ensures that the data meets the requirements set by the company or by the client of the process, to finally store it in a data warehouse. The process thus ensures the relevance, usefulness, accuracy, quality and accessibility of the data by its consumers.
As we have already seen, SSIS refers to SQL Server Integration Services and its use extends to a large number of procedures related to the movement and migration of data assets, beyond the ETL process. SSIS is basically a data integration tool designed to solve business needs and workflow applications. Among its multiple systems, SSIS has a specific tool to perform ETL processes.
To achieve data integration and carry out the ETL process, SISS follows these steps:
- SISS starts from an operational data warehouse. That is, a type of database created to integrate data on which additional operations have to be performed.
- The process of extracting, transforming and loading data (ETL) is carried out.
- The data warehouse collects data from multiple data sources so that it can be used profitably.
- Finally, the data is stored in the data warehouse, where data from various sources can be managed to meet business needs and aid in decision making.
However, SSIS has the capability to perform other processes in addition to ETL. For example, the suite of systems allows data cleansing, aggregation and merging. It also simplifies moving data from one system or source to another and can connect to a large number of APIs to extract data from SQL Server, Oracle, DB2, Excel files, etc.
Beyond data migration, SSIS has graphical tools and can perform workflows such as FTP operations, send e-mails, etc.
It is clear, then, that talking about the differences between ETL and SSIS is not entirely accurate, since ETL is a process and SSIS is a Microsoft tool used to perform that process.
Stages of the ETL process
1. Extraction: Extraction is the first phase of an ETL process and consists of extracting the data from the systems and applications where it is located. In this phase, the data are converted into a single format and cleansed. In other words, all data are made compatible with each other so that they can be interpreted as a whole and those that are unnecessary are eliminated and errors are corrected.
2. Transformation: In this stage the data is transformed to match the structure we have determined in our data warehouse. That is to say, in a data warehouse we can group the data by freely chosen concepts -according to our business areas, the departments of the company or the use we are going to give them, for example-. In addition, in this phase the data is technically validated, normalised and filters, crosses and aggregates are applied.
3. Load: As its name suggests, in this phase the transformed data are loaded into the data warehouse where they are recorded forever - the data warehouse is a non-volatile system that prevents the loss or alteration of the information. In addition, the data warehouse is updated as new data is added and is the best way to obtain a historical record of all the company's information.
Carrying out an ETL process is not an easy task and, in fact, it can be very complex. Therefore, it is necessary to have the right knowledge to choose the right tool and type of processing depending on the amount of data available, the size and sector of the organisation, the operations to be performed, etc.
ETL processing
There are three possible ETL processing methods that are not mutually exclusive, i.e. they can be combined in one and the same ETL.
- Data processing: A data file is divided into different files so that the information can be consulted in parallel.
- Pipeline (segmentation) processing: This processing supports the simultaneous operation of several components in the same file. In other words, operations can be performed in columns 1, 2 and 3 in parallel, thus saving a lot of time and achieving results more quickly.
- Component processing: In this case, simultaneous operation occurs in multiple processes that may be in different files, but in a single data stream.
What features should an ETL have?
An ETL process should meet a number of requirements in order to work properly:
- Adaptability: An ETL must be able to connect to multiple systems that include data from different data structures such as relational and non-relational databases, XML, single files in multiple formats, ERP, CRM and SCM applications, web services such as Google Analytics, emails, office automation platforms, etc.
- Ability to deliver data: It must also have the ability to contribute data to other applications.
- Data transformation capability: Data transformation is one of the key phases of an ETL process. An optimal ETL should be able to perform basic transformations (simple calculations or type conversion), intermediate transformations (aggregations, sums, etc.) and complex transformations such as text analysis.
- Metadata and data modelling capabilities: An ETL must be able to retrieve original data models, create and maintain models, map from physical to logical models, synchronise metadata changes and have an open metadata repository.
- Design capability and development environment: The ETL must have competencies to graphically represent repository objects, support for testing and debugging, manage workflows, etc.
- Data management capability: I.e. legitimise data quality, create profiles, have mining capability, etc.
- Adaptability to other platforms: ETL must adapt to the other platforms the company works with.
- Capacity for administrative operations such as monitoring data integrations, handling errors, executing security controls, etc.
- Architecture and integration capacity: An ETL must guarantee interoperability between the different elements that make up the data integration tool.
When do we need an ETL process?
Today, virtually every company requires an optimal data integration process such as the ETL process. This process establishes a business decision support system and makes data accessible from anywhere and at any time.
Furthermore, the great advantage of the ETL process is that it ensures that data is always clean, consolidated and ready for use. The data warehouse where the data is stored after the ETL process, usually a data warehouse, is automatically updated. In addition, the process provides a historical data repository, as it stores all historical data, stores it without the possibility of loss and allows analysts to be able to make full temporal comparisons, analyse different time periods, discover temporal trends and even predict future trends.
ETL maximises the value of our data assets, promotes data quality and assists decision makers in getting accurate answers to key business questions. In addition, by integrating a wide variety of data from multiple sources, ETL minimises production system processing, reducing the turnaround time required for analysis and reporting.
The primary purpose of an ETL process is to make data ready to guide the decision-making process. ETL builds and loads a data warehouse with consolidated, complete, reliable and useful data. Unnecessary, erroneous or ineffective data is not stored in the data warehouse, which is also a significant cost saving for the organisation.
In addition to cost savings, we can even use ETL to generate revenue from our data assets. How? Let's look at an example. Let's imagine that a hotel manager needs to collect data on average occupancy and room rates for each room in his hotel. Using the ETL process and other business intelligence tools, we can discover the aggregated revenue of each room and, for example, find statistics on the overall market share. By gathering this data, the hotel manager can gauge his position with respect to various markets in the sector, analyse the evolution of the trend over time and, consequently, decide to offer strategic discounts on his room rates.
Do you want to implement an ETL or ELT process in your organisation?
How to choose the right ETL tool?
Today, there are a large number of tools specifically dedicated to implementing ETL processes, but obviously not all of them have the same capabilities or are equally efficient for all companies. To ensure that the ETL process operates effectively, it is essential to know how to choose the right ETL tool for your company.
First of all, when choosing an ETL tool, we must ensure that the tool meets all the requirements —explained above— that an ETL tool should have according to Gartner.
Once we are clear about the capabilities that the tool should have, we can move on to consider which of the different types of ETL tools that exist. It is essential to bear in mind that the effectiveness of the tool will depend on the characteristics of the company, the amount of data it works with, and the uses it wants to make of that data. In other words, all types of ETL tools have positive aspects, the key is to choose the type of tool that best fits our organisation and the purpose of our data.
ETL tools can be classified into the following categories:
- ETL Enterprise: They are tools that have a large number of functionalities and have the capacity to connect a wide variety of data sources in different formats. On the other hand, due to their capabilities, these tools tend to be more expensive than others and are therefore ideal for large corporations.
- ETL Open Source: They are open source ETL tools and, as such, free of charge. These types of tools are usually created from a general approach and, therefore, in order to adapt them to the particular needs of a company, they require customisation processes that are usually complex and lengthy. Furthermore, these customisation processes must be carried out by experts in the process who are usually not part of the company's team, which means that this type of tool has a high added cost despite being free of charge.
- Customized ETL: Within this category, we find tools developed by companies on a customised basis. This type of customised processes usually perfectly meet the specific objectives and needs of the organisations that design them but, on the other hand, the development process is difficult, involves a great deal of effort, is usually very costly and, obviously, also requires experts in the field.
- ETL Cloud: In this group we find the ETL services integrated in the cloud of corporations such as Google, Microsoft or Amazon. These services are more flexible than the other ETL tools and have other advantages, such as the fact that the implementation and start-up of the process is usually much faster than in the other cases.
Once we are familiar with the different types of ETL tools, we can begin to assess the characteristics of each one and reflect on what we should base our choice of the optimal tool for us.
What should we look out for?
When choosing an ETL tool, it is essential to look at least at the following aspects:
- Price: Taking into account the acquisition cost (the price of the tool itself), as well as the added costs, among which we find the costs of consultancy, support, employee training, etc. As we have already mentioned, some ETL tools have a higher acquisition cost than others and, on the other hand, some of them usually involve more additional costs than others.
- Usability: When choosing an ETL tool, it is essential to consider who is going to use it and for what purpose. Some tools are easier to use than others and each type of tool requires specific technical knowledge. Some can only be used by experts and others, through appropriate training, can be used by less technical users.
- Compatibility: It is a basic requirement that the ETL tool is compatible and can be connected to all systems, software, applications and devices in the company. Otherwise, the ETL process will not be fully implemented and the acquisition of the tool will not have made much sense. Therefore, it is necessary to get a tool that is compatible with the company's existing processes, devices and ways of working.
- Speed: ETL tools have different speeds that primarily depend on the calculation capacity of the tool and the amount of data we need to extract, load and transform. As we have already announced, large corporations that work with Big Data usually opt for ETL Enterprise tools.
- Data quality: In addition to loading and extracting data, one of the basic functions of ETL is to transform, validate and consolidate our data. Therefore, the ETL tool we choose must have the capacity to perform these functions and to guarantee that our data is of high quality, as well as to ensure its governance.
- Management and control: Finally, it is desirable that our tool has built-in control systems to identify problems, errors and inefficiencies in the data so that they can be corrected.
Market leading tools according to Gartner
The technology consultancy Gartner publishes a report every year —the Gartner Magic Quadrant— in which the best technological tools are ranked according to different functionalities and areas of action. The quadrant classifies tools, platforms and APIs into four categories: Challengers, Leaders, Niche players and Visionaries.
Looking at the latest report on data integration tools published by Gartner in 2020, we see that the top 10 ETL tools according to the multinational are: Informatica, IBM, SAP, Oracle, SAS, Microsoft Azure, Qlik, Talend and TIBCO. See the full Gartner Magic Quadrant for Data Integration Tools below:
Advantages of an ETL process
Having an ETL process in place is a competitive advantage for any company, as it allows a company's employees and managers to access data quickly and easily and to handle it even for people who are not necessarily experts or who do not have the technical skills.
Furthermore, ETL is a basic element to ensure data integration, data management and data governance which, in turn, contribute to better strategic decision making.
Likewise, the ETL process is closely related to data quality, that is, to the quality analysis of our data, since when data goes through the process it is validated, cleaned, cleaned, errors are corrected, etc. In other words, implementing an ETL process ensures that our data has the desired quality, which will allow us to make better business decisions, avoid operational errors, reduce the cost of data repair and free the data management team from unnecessary tasks.
The following is an overview of the main advantages of implementing an ETL process in a company:
- Data-driven decisions: ETL es uno de los procesos clave que fomenta la cultura data-driven y propicia la toma de data-driven decisions (DDDM). Así pues, implementar un proceso ETL puede ser el impulso que tu empresa necesita para tomar mejores decisiones empresariales basadas en datos consolidados, de alta calidad y recopilados con objetivos empresariales concretos en mente.
- Data-driven decisions: ETL is one of the key processes that fosters a data-driven culture and enables data-driven decision making (DDDM). Thus, implementing an ETL process can be the impetus your company needs to make better business decisions based on consolidated, high-quality data collected with specific business objectives in mind.
Big Data Capabilities: The ETL process enables the management of large amounts of data by organisations.
Data integration: ETL involves the collection and integration of data in disparate formats and from multiple data sources, in a single place. Thus, the process facilitates the integration of all the organisation's data and helps the company to gain a complete and more reliable view of its business activity and situation.
Data quality: As mentioned above, the ETL process ensures data quality, as the process itself involves data cleansing, cleansing and transformation. By carrying out the ETL process, erroneous, inconsistent, redundant and unnecessary data are eliminated and useful data are validated and consolidated. - Intelligibility: ETL has the ability to convert the most unstructured and complex data into information that is easily understandable, useful and ready to be analysed in data analysis applications such as Power BI. At Bismart, as a Microsoft Power BI partner, we use this tool to develop attractive, understandable and interactive reports and visualisations.
- Automation: One of the great benefits of ETL processes is that they automate processes, eliminate unnecessary and time-consuming tasks and allow data scientists and engineers to focus on more productive activities.
ETL Cloud Service
One of the newest trends related to ETL is its progressive development in the cloud: ETL Cloud Service. More and more companies are opting to carry out their ETL processes in cloud environments rather than on local servers. This is confirmed by an IDG study published in 2020, which shows that 81% of companies already have one or more applications and part of their infrastructure in the cloud. The report also highlights that of those companies, 92% have part of their IT environments in the cloud.
ETL's move to the cloud is no coincidence and is related to the evolution of the world towards the digital environment. As in other spheres of life, businesses today store the vast majority of their data assets, tools, services and software in the cloud.
As we explained in a previous post where we explained the benefits of cloud integration, acquiring or developing a cloud-based ETL process offers certain advantages such as greater speed or the fact that the installation and implementation of ETL is much faster. In addition, the cloud environment allows real-time streaming, favours integration and scalability, saves money and is a way to ensure the security of the company's data, since companies that offer cloud environments for ETL processes must constantly review, reinforce, strengthen and renew their security systems.
Differences between on-premise ETL and Cloud ETL
The ETL process differs depending on the environment in which it takes place. In an ETL process carried out on a local server, data is extracted from a local data source and, after transformation, loaded into another local server or local data warehouse. This type of warehouse or server is usually physically located within the company's own office.
ETL Cloud Service or the ETL process carried out in the cloud performs exactly the same function as local ETL and involves the same steps. The only difference, then, is that both the source data warehouse and the data warehouse where the consolidated data is finally loaded are digitised and stored in the cloud.
This difference, however, conditions the process and implies that an on-premises ETL process and a cloud ETL process are carried out in somewhat different ways.
In the cloud environment, the process is developed from shared computing clusters that operate as independent entities and are located in different parts of the world. Thus, computing processes are developed through workspaces in cloud environments through systems such as, for example, Data Factory. These processes achieve higher levels of connectivity between data sources than local ETL and enable graphical management of the data flow through interfaces that connect the sources of origin with the data destination stores.
In addition to greater capabilities, speed and connectivity, the tools and systems involved in the ETL cloud process effectively address most of the limitations of the more traditional ETL process. The main problems that the cloud solves are the high cost of data warehouses and physical servers or the loss of all the information collected and consolidated every time a technical failure occurs. On the other hand, the cloud environment dispenses with tasks necessary in an ETL process carried out on a local server: updates, bug fixes, maintenance, etc.
All of this makes speed the biggest benefit of running an ETL process in the cloud. Organizations that still run their ETL processes on local servers are at a disadvantage compared to companies that have moved to the cloud, as they will hardly be able to compete with the agility and dynamism of the cloud service.
Another important advantage of the cloud is that it allows for greater scalability. The process, as it is carried out without requiring any type of installation or hardware, allows companies to expand their resources when they need them, without the need for large investments or paying for material that they do not use. This type of automatic expansion also saves time and money, since in the cloud companies only pay for the processing capacity and space they use. In contrast, on a local server, it is extremely difficult to adapt the size and capacity of the server to the specific needs of the business at any given time. Business needs are constantly evolving and scalability is a requirement for any company that intends to focus on expansion and growth.
Ultimately, the most noticeable and appreciated benefit of cloud ETL is the superior level of speed. The cloud facilitates the speed of computing tasks and favors the optimal and agile development of business intelligence activities, since local environments are more likely to run aground due to the rapid growth in the volume of data with which organizations work. Furthermore, the cloud environment connects to both local environments and other cloud-hosted services.
Advantages and disadvantages of ETL in the cloud
As we have already seen, as technology, digitization and the evolution of Internet capabilities have progressed, companies have been moving their ETL processes to the cloud environment. Along these lines, IDG announced in 2018 that 38% of organizations acknowledged that their IT departments were asking to move their entire IT infrastructure to the cloud.
As we have already seen, the reasons are manifold. The most prominent advantage of ETL Cloud Service is undoubtedly the increased speed. The computational tasks required by any ETL process are performed much faster in the cloud. In fact, on-premises servers, business intelligence processing and activities can be interrupted as a company's data assets expand and grow. Most on-premises servers have limited capacities and, when the time comes, they may no longer perform optimally. In the cloud, on the other hand, space can easily scale and organizations can increase capacity and processing when required. In addition, the cloud environment is far more flexible than the on-premises environment in that companies can pay only for what they use.
ETL Cloud: Data warehouse automation
Today, the data warehouse lifecycle can be automated using state-of-the-art technology based on sophisticated design patterns and processes that automate data warehouse planning, modeling and data integration. Automation serves to avoid time-consuming tasks such as ETL code generation in a database.
The cloud-based ETL process can be applied to automate data warehousing. A data warehouse is a type of database specifically designed to facilitate the storage, filtering and analysis of large amounts of data and to allow simultaneous and cross-referenced querying and analysis of data, without the need to combine and consolidate information from multiple data sources
Data warehouse automation steps
In a traditional data warehouse, data goes through three phases:
- Relational database (OLTP): In this phase data is extracted from relational databases using SQL scripts. Prior to their transfer, the data are cleaned and cleansed to avoid the inclusion of inaccurate information or data containing errors. At this stage, the data is based on a relational model between entities and is used for online transactional processing.
- Analytical Data Warehouse (OLAP): During this stage, transactional data is modeled and transformed using star or snowflake schemas. It is then transferred to an online analytical processing server or OKAP, using a multidimensional or relational OLAP data model. This whole process is carried out to structure, sort and simplify the data so that its consumers can query and report on it. Once transformed, the data is loaded into a data warehouse, where the ETL process is completed.
- Data analysis and reporting: After the ETL process, the data is usually integrated into analysis and business intelligence platforms where data scientists explore, analyze and transform it into corporate reports or data visualizations.
If we use data warehouse automation software to perform this process, we can aggregate and move the disparate data directly from the source data sources into a single data warehouse. In addition, automation does not require code. The software automates the ETL code deployment and batch execution of the warehousing process and offers a seamless approach based on agile methodologies.
The automation software performs a wide variety of functionalities, including:
- Data modeling
- Data integration
- ETL and ELT processes
- Inclusion of denormalized, normalized and multidimensional structures.
- Connection with data providers and other BI tools
ETL to increase the value of a data warehouse
Performing an ETL process enables you to maximize the value of your data warehouse. A data warehouse simply acts as the place where data is stored. Business intelligence tools, on the other hand, serve to perform analysis on the data once it has been transformed. ETL is the intermediate process that prepares all data, wherever it comes from and in whatever format it is in, so that it can be analyzed and used.
In this sense, it is essential to understand ETL as a process linked to the acquisition of a data warehouse. Without the ETL process, the data warehouse does not allow the value of the data to be exploited.
Along the same lines, if our data warehouse is in a cloud environment, it is advisable to opt for a cloud ETL process.
From ETL to ELT: How Big Data has transformed the ETL process
The emergence of Big Data has brought about a significant transformation in the way data is managed and stored, leading to new demands on traditional data warehousing processing. As new requirements have emerged in terms of volume and velocity, ETL processes have evolved into a different perspective known as ELT. This new perspective has emerged in response to today's demands for volume, velocity and veracity in data integration and storage, and has changed the usual order of the ETL process.
The term "Big Data" first emerged in the late 1990s to describe the growing problem faced by organizations in relation to the amount of data generated. In 1997, a group of NASA researchers published a paper highlighting that the increase in data was becoming a challenge for existing IT systems. This situation spurred technological advancement towards platforms capable of handling massive data sets. In 2001, the US firm Gartner published research entitled "3D Data Management: Controlling Data Volume, Velocity and Variety", which first mentioned the "3Vs" that Big Data technologies needed to address: volume, velocity and variety.
Big Data posed new challenges for the ETL process. The increase in volume, velocity and variety demanded by Big Data challenged the capacity of ETL tools, which often could not handle the pace required to process massive data sets, resulting in a lack of capacity and velocity, as well as cost overruns.
The emergence of new data formats and sources, along with data consolidation requirements, revealed the rigidity of the ETL process and changed the traditional way of consuming data. The demand for greater velocity and variety led data consumers to need immediate access to raw data, rather than waiting for IT to transform it and make it accessible.
In addition, Big Data also drove the creation of data lakes, which are data warehouses that do not require a predefined schema, unlike traditional data warehouses, which introduced more flexible storage schemes.
ETL tools, which were built primarily with IT management in mind, are often complicated to install, configure and manage. Moreover, these tools conceive of data transformation as a task for IT professionals only, making it difficult for data consumers to access. According to this logic, consumers can only access the final product stored in a standardized data warehouse.
In this context, innovation emerged, reshaping the process and making it more suitable for working with Big Data and cloud services. ELT provides greater flexibility, scalability, performance and speed, while reducing costs.
However, ELT also presents its own challenges. Unlike ETL, ELT tools are designed to facilitate data access to end consumers, which democratizes data access and allows users to obtain data from any data source via a URL. However, this can pose risks to data governance.
Is ELT the future of ETL?
Although ELT has improved both extraction (E) and loading (L) of data, challenges remain in terms of transformation (T). Today, data analytics plays a critical role in business. Despite ELT's efforts, data transformation-based analytics has not been simplified and remains the purview of the IT department, especially engineers and data scientists. Transforming raw data into consumer-ready assets still requires multiple tools and complex processes that data consumers do not have the capacity to address.
In addition, the various processes and tools required for data transformation still present the same problems as ETL, such as the speed of the process, the amount of resources required, cost and lack of scalability.
Has the problem been solved? For ELT to definitively replace ETL, ELT tools would have to evolve. In terms of this evolution, it is expected that in the near future these tools will include data governance capabilities and will gradually address the remaining difficulties.
Both ELT (Extract, Load, Transform) and ETL (Extract, Transform, Load) are processes used to move raw data from a source system to a target database, such as a data lake or data warehouse. These data sources may reside in multiple repositories or legacy systems, and are then transferred to the target data warehouse via ELT or ETL.
What is ELT (Extract, Load and Transform)?
In a data processing approach known as ELT (Extract, Load, Transform), unstructured data is extracted from a source system and loaded directly into a data lake for further transformation. Unlike the traditional ETL (Extract, Transform, Load) approach, the data is immediately available to business intelligence systems without requiring prior preparation. This allows analysts and data scientists to perform ad-hoc transformations as needed.
The ELT approach is especially useful for performing basic transformations on data, such as validation or de-duplication. These processes are updated in real time and are applied to large volumes of data in their original state.
However, the ELT approach is relatively new and its technical development has not yet reached the same level of advancement as the ETL approach. Initially, the ELT process relied on hard-coded SQL scripts, which increased the risk of coding errors compared to the more advanced methods used in ETL.
What is ETL (Extract, Transform and Load)?
In an ETL process, data is extracted from its source and undergoes transformations to prepare it before it is loaded into the target systems.
In a traditional ETL scenario, unstructured data is extracted and loaded into a staging area, where it undergoes a transformation process. During this stage, the data is organized, cleansed and transformed into structured data. This transformation process ensures that the now structured data is compatible with the target data warehousing system, usually a data warehouse.
ELT or ELT? Main differences
Although ETL tools and the ETL process continue to dominate the transfer and integration of data in the business world, recently the ETL process is being approached from another perspective: ELT. While it is true that some companies are already opting for ELT instead of ETL, it is necessary to emphasize that ELT does not have to be a substitute for ETL and that both processes can complement each other and, in fact, some experts are already talking about the ETLT process that combines both perspectives, creating the following pipeline: Extract, Transform, Load and Transform.
The main difference between the two is the alteration of the usual ETL order: Extract, Transform and Load. In ELT, on the other hand, the order of the process is: Extract, Load and Transform. In other words, in an ELT process, data are extracted from the source, loaded into a single place —usually a data lake— and finally, once the data are already integrated into the same data warehouse, data transformations are carried out: data normalization, filtering, merging, data validation, aggregations, etc. In the ETL process, on the other hand, data is extracted and transformations are performed before loading the data into a data warehouse.
In practice, however, it is common for the various steps of these processes to occur in parallel. That is, in ETL processes, companies usually perform data extraction, transformation and loading at the same time to save time. The main difference between ELT and ETL, then, is not so much in the order of the process but in the location where the data transformations are performed. In the case of ETL, the transformations take place in a temporary data warehouse - where the data is stored after being extracted and before being loaded into the final data warehouse. This temporary warehouse is powered by a specialized processing engine that enables the transformations. In the case of ELT, on the other hand, the transformations happen in the back-end data warehouse, which is capable of performing transformations and does not require a specialized engine. In this sense, ELT has a simpler architecture than ETL. When data is extracted from the source, it is directly included in the staging area of the target data warehouse where transformations are performed on the raw data. Once transformed, the data is copied to another area of the data warehouse.
For the ELT process to be productive, it is essential to have the necessary computing and processing capabilities to be able to perform data transformations. In practice, this translates into the acquisition and use of tools such as Azure Databricks, Azure Data Lake or Azure Synapse Analytics. On the other hand, this order of processing requires an environment that can be scalable and that allows space and capacity to be increased when necessary. Environments such as Azure are ideal for this type of process, as Azure tools are pay-as-you-go. That is, companies pay for the space they use and can increase the amount of space whenever they want.
As we have seen, while ETL Cloud offers greater speed than on-premises ETL, ELT increases that speed even more, this being precisely its biggest plus point. ELT speeds up data ingestion by avoiding the heaviest pipeline operations and data copying, an essential step in an ETL process.
Below, we explore the main differences between ETL and ELT:
What are the benefits of ELT compared to ETL and vice versa?
The main advantage of ELT: higher velocity
The main advantages of ELT over ETL and the reason why many companies are betting on this new trend is undoubtedly the greater speed and flexibility. ELT has a much higher data ingestion speed than ETL due to the fact that it omits data copying, which occurs in ETL, and the pipeline avoids other arduous operations.
Also, the other great strength of ELT is that it provides flexibility to data analysts who can load the data without having to define beforehand what they are going to do with it and can, on the contrary, perform the transformations they want at the moment they need to do so. In addition, ELT allows, for the same reason, data analysts or data scientists to load data without first determining its structure. ELT, on the other hand, is a more rigid process that requires analysts to define the structure and usage of the data before it is loaded and also makes it difficult to retrieve the original data.
Although ELT brings benefits that ETL does not, the reverse is also true. ETL, for example, is a more suitable process for working with structured data and can enhance data security, quality and governance.
In any case, both perspectives have their pros and cons and will be more or less optimal depending on the characteristics of each company and its data assets, as well as the intended use of the data. Even the parallel use of both could mean an increase in value for the company.
The main advantages of ETL: more security
The ETL process involves a number of additional steps compared to ELT, which makes it a bit slower. These steps include loading the data into a staging area to perform the necessary transformations. However, in exchange for this increased complexity, ETL offers a more secure process that results in cleaner data and less likelihood of coding errors.
One of the key advantages of ETL is its ability to perform transformations on the data before loading it into the target warehouse. This provides an additional layer of security, as the transformed data is loaded more reliably and ensures data integrity. On the other hand, ELT tools are designed to allow more direct access to data by end users, which implies a higher security risk and makes it more challenging to ensure data governance.
In addition, the ETL approach offers specific benefits in terms of compliance with data privacy regulations, such as the GDPR (General Data Protection Regulation). In an ELT process, sensitive data may be more exposed to risks of theft or hacking.
In terms of advanced capabilities, ETL tools have evolved significantly due to their longer time on the market. These include comprehensive data flow automation functions, rule recommendations for the extraction, transformation and loading process, a visual interface for specifying rules and data flows, support for complex data management, as well as additional security and privacy compliance measures. These features make ETL tools a more robust and established choice.
ETL vs. ELT: Which option is better?
Both ETL and ELT are valid options to carry out the data extraction, transformation and loading process. Their validity and prevalence over the other option will always be linked to the specific characteristics and needs of each company, as well as the nature and amount of data assets.
If your company already has an ETL processing system that works effectively, there is no reason to give it up and switch to an ELT process. However, if your company plans to acquire more data warehouses - especially cloud warehouses - in the future, it may be worthwhile to go with ELT.
Always keep in mind that ELT is not a system and does not depend on a specific tool. It is an architecture and, therefore, if you have an ETL tool such as SSIS, you can easily integrate it into the ELT process.
The ETLT process, which is nothing more than the marriage of ETL and ELT, should also not be discounted. ETLT can maximize the value of the ETL and ELT process by leveraging the best of each. In an ETLT process, an ETL tool could extract data from its source data and store it in a data lake and then, through the ETL process, the data would be extracted from the data lake and transformed and stored in a data warehouse.
There is no doubt that the new ETL perspective, ELT, is here to stay. However, one should not think that ELT will replace ETL.
When is it appropriate to use ETL and when is it appropriate to go for ELT?
When is it better to perform the transformations, before or after loading?
The choice between ETL and ELT depends on several factors and no definitive answer can be given as to which is better in all cases. The choice of the most appropriate approach should be based on the specific needs and requirements of each project.
In general, ETL has traditionally been used when a thorough transformation of the data is required before loading it into the target warehouse. This approach is preferable when data quality and consistency needs to be ensured prior to analysis. By performing transformations before loading, validation rules can be applied, erroneous or duplicate data can be cleaned, and data can be structured according to a predefined schema. In addition, ETL is often more secure in terms of data privacy and data governance compliance, since transformations are performed before the data reaches the warehouse.
On the other hand, ELT has become more popular in the context of Big Data and the need to process large volumes of data efficiently. ELT enables fast loading of raw data into a data lake or unstructured data warehouse, which speeds up the ingestion process. Transformations are performed later, when the data is available in the warehouse, allowing ad-hoc analysis and greater flexibility to explore the data without prior constraints. ELT is especially useful when high processing speed is required and agility in data exploration and discovery is prioritized.
It is important to carefully assess the needs of the project, considering factors such as the volume and variety of the data, the complexity of the required transformations, security and compliance, as well as the resources and skills available in the team. In many cases, it may be beneficial to combine elements of both approaches, using ETL for initial data cleansing and structuring, and then leveraging ELT for subsequent analysis and exploration. The choice will depend on the particular circumstances and the balance between the quality, speed and flexibility requirements of the project.
ETL and ELT: Use Cases
Companies must choose between an ETL process or an ELT process depending on the use case.
ETL Use Cases
The ETL process is efficient in scenarios where small to medium-sized data sets requiring complex transformations are handled. However, its efficiency decreases as data sets grow, as aggregation operations become more complicated.
What types of companies can benefit from using ETL?
Organizations that need to integrate and synchronize data from multiple sources find ETL a suitable solution. It is especially useful for companies that have multiple data repositories in different formats, as ETL allows them to unify the format of the data before loading it into the target location.
Also, organizations that need to migrate and update data from legacy systems benefit from the ETL process. Legacy systems often require transformations to adapt the data to the new structure of the target database, and ETL provides the necessary tools to perform this transformation efficiently.
ELT Use Cases
ELT is a solution specially designed for the efficient management and integration of large amounts of data, whether structured or unstructured. It is also particularly suitable for environments that require fast, real-time access to data.
There are certain types of companies that would benefit from choosing ELT:
- Enterprises with large volumes of data: Those organizations that handle massive amounts of data and transactions can take advantage of the speed and flexibility ELT offers.
- Businesses that need instant access to data: For companies that generate large volumes of data and need to query it in real time, delays can have significant negative consequences.
ETL is best suited for companies that handle sensitive data and prioritize compliance with data protection regulations.
In summary, while ETL is better suited to small to medium-sized data sets that require complex transformations, ELT is the more appropriate choice when it comes to dealing with large volumes of data, both structured and unstructured. In addition, ELT is especially useful in environments that demand the use of real-time data and offers faster process execution.
Is your ETL compliant with GDPR?
The General Data Protection Regulation (GDPR) is the European Union's new regulatory framework regarding data protection that came into force in May 2018, after a two-year transition. The new legislation was born with the will to unify the data protection laws of all European Union countries and to strengthen citizens' control over their personal data and information.
The new law made all companies and organizations dealing with EU citizens' data to review and adapt their data protection policies and, according to a Gartner study, the vast majority of companies had still not implemented the necessary measures to comply with the law months before the final implementation deadline.
In this sense, most companies that were working with ETL processes had a problem since most ETL systems were not GDPR compliant. This forced companies to go for new systems such as Master Data Management Enterprise Information Integration (MDM/EII). MDM/EII is a data integration and interoperability technology that enables querying data in multiple formats and from different data sources and specially designed to comply with GDPR. The innovative system that streamlines the transfer of information between systems ensures that, at every step of the process, data collection and integration is carried out in compliance with the requirements of European legislation and guarantees the integrity of data that is consolidated, accurate and congruent with the purpose for which it is transferred and used.
Information-oriented integration does not require modifications to the systems involved, but only the implementation of the information exchange mechanism between the data repositories of the respective applications. It represents the simplest and least impactful form of integration compared to other types of integration: process or service oriented.
Likewise, after two years of GDPR, companies' compliance with personal data protection rules is still in question. In fact, Forbes magazine already published in 2017 that more than 50% of the companies surveyed were not prepared for the new regulatory measures. THE forecast was confirmed by the law firm DLA which published in January 2021 that the year 2020 GDPR fines had increased by 40% compared to the previous 20 months. This fact may be the result of either companies increasingly failing to comply with the regulation, which would be strange considering the consequences, or the European Union is progressively strengthening sanctions to force compliance with the new privacy and data protection policies.
This scenario leads us to reflect on the role of data quality and data governance in the application of regulatory measures. Compliance with the GDPR by organizations involves ensuring data quality and creating data governance measures to control, validate and consolidate data and, in particular, to be clear about what data the company stores, where all the data assets are, what processes they go through, whether they are secure, whether they are reliable and what their purpose is.
In this sense, then, the problem of non-compliance on the part of companies does not lie with ETL systems. While it is true that many were not prepared at the time, the operation of ETL tools is safe if their use is complemented by data quality and data governance policies, which are essential for compliance with the law.
The end of cookies in Google Chrome: One more step in data protection
In 2019, Google made public its intention to phase out third-party cookies in Chrome and block other covert tracking techniques, such as fingerprinting. Justin Schuh, Chrome's director of engineering, explained that this was Google's strategy to redesign web standards and ensure default privacy. In early March 2021, Google confirmed its commitment to completely eliminate third-party cookies in Chrome by 2022. The company plans to implement privacy protection measures through the Privacy Sandbox plan. Some of these new rules are already being tested in Chrome 89, with the expectation of offering them to Google Ads customers starting in April.
While the initiative has been well received by the general public, it has raised concerns among businesses and advertisers about the future of digital advertising and marketing in a cookie-less world.
👉 To help businesses adapt, Bismart has developed a practical guide that provides measures and precautions to be taken in this new scenario. You can download the "Guide to Surviving in a Cookie-Free World" below.