DATA Pipeline ATLAS 3 ETL Structure
Originally
ADR-0059 DATA_PIPELINE_ATLAS-3-ETL_Structure (v3) · Source on Confluence ↗Context
To process the data, Atlas need to implement data pipeline technique that guaranties the order of data transformations.
The order of data pipeline transformations is given by third party tool called orchestrator.
Additionally the data processing solution must fit both chosen data architecture and product needs.
Decision
Invalid Image Path
Every data pipeline can be represented by following types of code blocks.
Those blocks types are:
- Ingestion Block
- Transformation Block
- Data Replication Block (optional)
- Data Validation Block
- Data Validation Sidecar
Invalid Image Path
The ingestion block is the part of the code that serves as a first operation of data pipeline. It is responsible for taking the data out of the data lake and normalize it to default data lakehouse processing format - delta table.
Transformation block
Invalid Image Path
This block represents a set of data transformations that creates a meaningful product. If the product is required due to quality or internal Atlas team needs it should be stored in a Silver Layer, if the output of the transformation is a business-level product (eg. Airspace) it should be stored on the gold layer.
Data Replication block
Invalid Image Path
This is an optional block for the products used by some other systems than OLAP (online analytics processing). An example of that can be Airspace table. This product is an input to an apps so to increase read speed it was replicated to transactional DB (PostGIS).
Data Validation block
Invalid Image Path
The data validation block is a gatekeeper for correct data processing. It is responsible to running user-defined check on every row of the table.
This block has 2 outputs: curated table and data quality report.
- The curated table consists only of the elements of input table that fulfills data quality expectations set in the block.
- The Data quality report is the user-readable summary of the check
Data Validation sidecar
Invalid Image Path
The sidecar block is a check that runs nightly for data quality tasks that requires more time to process. It was separate from the main pipeline to lower the latency and can be attached to any of the pipeline products. This block produces human-readable report as an output. An example of the check running in this kind of block is data profiling.
Consequences
Alternatives Considered
- Whole ETL as one transformation
Pros:
- Data is processed and delivered faster
- One big block simplifies architecture
Cons:
- Hard to maintain
- Transformations in between the input and output are hard to track. In case of developer mistake, the bug is hard to be detected