Skip to content
DATA Pipeline ATLAS 5 ETL Batchmodestructure

DATA Pipeline ATLAS 5 ETL Batchmodestructure

Andi Lamprecht Andi Lamprecht ·· 2 min read· Accepted
ADR-0052 · Author: Sybil Melton · Date: 2025-02-07 · Products: platform
Originally ADR-0063 DATA_PIPELINE_ATLAS-5-ETL_BatchModeStructure (v3) · Source on Confluence ↗

Context

Atlas data pipeline must support Batch Mode for processing whole dataset.

Decision

Enabling batch mode processing is achieved by triggering the pipeline with the argument --batch=True. In this mode, the entire input dataset will be loaded and processed by an ETL, instead of processing only recent changes. The resulting output file will be written with the overwrite mode set to true (or selective_overwrite based on the specific scenario), effectively replacing all existing files. In the case of delta table outputs, this process will automatically update the change log, enabling seamless propagation of the changes throughout the pipeline.

This mode enforces following behavior on the blocks specified in DataArchitecture

Bookmarks Prune

Each Batch run should prune and overwrite bookmarks directory. More on that case see:

ATLAS-6-ELT_CDCMODESTRUCTURE

Ingesion Block

Invalid Image Path

When this block is triggered in batch mode it’s connector should skip diff phase, load whole dataset and then overwrite output delta lake table.

Transformation Block

Invalid Image Path

When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. After applying transformations it’s output should be saved in overwrite mode (or selective_overwrite for specific scenarios)

Data Replication Block

Invalid Image Path

When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. It should overwrite target external source table.

Data Validation Block

Invalid Image Path

When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. After validating whole dataset it should overwrite target curated table and generate full data report for new dataset.

Data Validation Block

Invalid Image Path

Since sidecar exist out of the pipeline, it should always read whole dataset.

Consequences

  1. Batch mode is triggered on special occasions
  2. Pipeline batch mode updates only that pipeline’s output and may trigger other pipelines to run in batch mode as well
  3. Data output is being overwritten

Alternatives Considered

Last updated on