Skip to content
DATA Pipeline ATLAS 3 ETL Structure

DATA Pipeline ATLAS 3 ETL Structure

Andi Lamprecht Andi Lamprecht ·· 3 min read· Accepted
ADR-0041 · Author: Sybil Melton · Date: 2025-02-07 · Products: platform
Originally ADR-0059 DATA_PIPELINE_ATLAS-3-ETL_Structure (v3) · Source on Confluence ↗

Context

To process the data, Atlas need to implement data pipeline technique that guaranties the order of data transformations.

The order of data pipeline transformations is given by third party tool called orchestrator.

Additionally the data processing solution must fit both chosen data architecture and product needs.

Decision

Invalid Image Path

Every data pipeline can be represented by following types of code blocks.

Those blocks types are:

  • Ingestion Block
  • Transformation Block
  • Data Replication Block (optional)
  • Data Validation Block
  • Data Validation Sidecar

Invalid Image Path

The ingestion block is the part of the code that serves as a first operation of data pipeline. It is responsible for taking the data out of the data lake and normalize it to default data lakehouse processing format - delta table.

Transformation block

Invalid Image Path

This block represents a set of data transformations that creates a meaningful product. If the product is required due to quality or internal Atlas team needs it should be stored in a Silver Layer, if the output of the transformation is a business-level product (eg. Airspace) it should be stored on the gold layer.

Data Replication block

Invalid Image Path

This is an optional block for the products used by some other systems than OLAP (online analytics processing). An example of that can be Airspace table. This product is an input to an apps so to increase read speed it was replicated to transactional DB (PostGIS).

Data Validation block

Invalid Image Path

The data validation block is a gatekeeper for correct data processing. It is responsible to running user-defined check on every row of the table.
This block has 2 outputs: curated table and data quality report.

  • The curated table consists only of the elements of input table that fulfills data quality expectations set in the block.
  • The Data quality report is the user-readable summary of the check

Data Validation sidecar

Invalid Image Path

The sidecar block is a check that runs nightly for data quality tasks that requires more time to process. It was separate from the main pipeline to lower the latency and can be attached to any of the pipeline products. This block produces human-readable report as an output. An example of the check running in this kind of block is data profiling.

Consequences

Alternatives Considered

  1. Whole ETL as one transformation

Pros:

  • Data is processed and delivered faster
  • One big block simplifies architecture

Cons:

  • Hard to maintain
  • Transformations in between the input and output are hard to track. In case of developer mistake, the bug is hard to be detected
Last updated on