DATA Pipeline ATLAS 5 ETL Batchmodestructure
Originally
ADR-0063 DATA_PIPELINE_ATLAS-5-ETL_BatchModeStructure (v3) · Source on Confluence ↗Context
Atlas data pipeline must support Batch Mode for processing whole dataset.
Decision
Enabling batch mode processing is achieved by triggering the pipeline with the argument --batch=True. In this mode, the entire input dataset will be loaded and processed by an ETL, instead of processing only recent changes. The resulting output file will be written with the overwrite mode set to true (or selective_overwrite based on the specific scenario), effectively replacing all existing files. In the case of delta table outputs, this process will automatically update the change log, enabling seamless propagation of the changes throughout the pipeline.
This mode enforces following behavior on the blocks specified in DataArchitecture
Bookmarks Prune
Each Batch run should prune and overwrite bookmarks directory. More on that case see:
Ingesion Block
Invalid Image Path
When this block is triggered in batch mode it’s connector should skip diff phase, load whole dataset and then overwrite output delta lake table.
Transformation Block
Invalid Image Path
When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. After applying transformations it’s output should be saved in overwrite mode (or selective_overwrite for specific scenarios)
Data Replication Block
Invalid Image Path
When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. It should overwrite target external source table.
Data Validation Block
Invalid Image Path
When this block is triggered in batch mode it’s connector should request whole delta lake file instead of recent-changes only. After validating whole dataset it should overwrite target curated table and generate full data report for new dataset.
Data Validation Block
Invalid Image Path
Since sidecar exist out of the pipeline, it should always read whole dataset.
Consequences
- Batch mode is triggered on special occasions
- Pipeline batch mode updates only that pipeline’s output and may trigger other pipelines to run in batch mode as well
- Data output is being overwritten