Skip to content
DATA Pipeline ATLAS 4 ETL Operationalmodes

DATA Pipeline ATLAS 4 ETL Operationalmodes

Andi Lamprecht Andi Lamprecht ·· 3 min read· Accepted
ADR-0032 · Author: Sybil Melton · Date: 2025-02-07 · Products: platform
Originally ADR-0062 DATA_PIPELINE_ATLAS-4-ETL_OperationalModes (v3) · Source on Confluence ↗

Context

The Atlas data pipelines receive data from various third-party sources through multiple channels. While some providers deliver only the updated data, others provide the entire dataset. Furthermore, the frequency of data delivery may vary, requiring Atlas architecture to be adaptable. Our objective is to ensure efficient processing in both scenarios, while minimizing the processing time.

Decision

ETL Modes

In order to meet diverse requirements, the ETL system must be designed to support different modes of operation. In this case, there are two specific modes that the ETL must accommodate:

  • default, which involves change data capture
  • special, which pertains to batch processing.

Default operation mode

Invalid Image Path

The default mode of the ETL revolves around change data capture (CDC). This mode focuses on capturing and processing incremental changes that occur in the source data. By identifying and extracting only the modified or newly added data, the ETL system optimizes efficiency and minimizes the processing time required for data integration. This approach is particularly beneficial in scenarios where real-time data synchronization or near-real-time updates are essential, enabling timely and accurate decision-making based on the most up-to-date information available. Additionally, this approach speeds up the data delivery by exclusively processing the modified data.

Special operation mode

Invalid Image Path

In order to prepare for rare scenarios such as disaster recovery or system reset, it is crucial to ensure that our pipelines possess the capability to seamlessly switch from processing incremental changes to processing the entire dataset. This flexibility is vital to safeguard our data and ensure the continuity of our operations during unforeseen events. By having the ability to transition smoothly to processing the whole dataset, we can minimize the risk of data loss and lower the system downtime.

Consequences

  1. ETL default operational mode is CDC. Only the changes of the data are being pushed and processed through the pipeline
  2. For specific scenarios Atlas ETL architecture allows user to trigger batch run for selected pipelines. This will reprocess the whole datasets and overwrite the output tables.

Alternatives Considered

  1. Only Batch runs:

Pros:

  • Maintaining data consistency is straightforward.
  • Simple data processing model.

Cons:

  • Hard to integrate sources that publish only the data change
  • Every time the input dataset changes, whole dataset is being processed - even the rows that were not affected by the change. This adds unnecessary computation for the pipeline.
  • Argument above impact grows with the data publishing frequency of the data source, since extra processing introduces a lag in data delivery.
  1. Only CDC runs

Pros:

  • Resilient for data publishing frequency changes

Cons:

  • In case of a bug it is difficult to maintain data consistency
  • On the long run the metadata of parquet / delta lake files can grow large and cripple the processing speed. This can be solved with lambda architectural pattern, although this pattern may introduce a lag to a data processing.
Last updated on