Skip to content
DATA Pipeline ATLAS 2 Dataarchitecture

DATA Pipeline ATLAS 2 Dataarchitecture

Andi Lamprecht Andi Lamprecht ·· 4 min read· Accepted
ADR-0031 · Author: Sybil Melton · Date: 2025-02-07 · Products: platform
Originally ADR-0055 DATA_PIPELINE_ATLAS-2-DataArchitecture (v3) · Source on Confluence ↗

Context

Atlas data pipelines need to be designed with a data processing architecture that aligns with specific needs, which can be classified into three categories:

  • Data pipeline
  • Data sources
  • Data pipeline products

Data Pipeline

  • Reliability: The production of data products must be dependable and consistent.
  • Auditability: The data generated by the pipeline should be traceable and easily auditable.
  • Error Reporting: Ingestion errors should be promptly identified and reported.
  • Data Security: The data must be securely stored to ensure its protection.
  • Scalability: The pipeline should efficiently handle increasing volumes of data.
  • Maintainability: The pipeline should be implemented with a tooling that supports it’s future evolution and allows version control.

Data sources

Invalid Image Path

Atlas relies on external data sources, the following considerations arise:

  • Data Quality: The quality of the data from these sources is uncertain or unknown.
  • Data Change Frequency: The frequency at which the data from these sources is updated or changed is uncertain or unknown.
  • Data Format Variability: The data obtained from these sources can be provided in any format, and the specific format may vary.

Data products

Invalid Image Path

The the outputs of the pipelines can be generated by multiple preceding pipelines and can serve as input for subsequent pipelines. To prevent any potential bottlenecks, it is essential that the pipeline supports the following:

  • Concurrent Writing: The ability to handle simultaneous writes to the pipeline, allowing multiple processes or pipelines to write data concurrently without conflicts or delays.
  • Read Capabilities During Writing: The pipeline should provide the capability to read data while it is still being written, enabling concurrent reading and writing operations without compromising data integrity or consistency.

Decision

To fulfill requirements above Atlas is implementing Data Lakehouse with medallion architecture.
Data Lakehouse assumes storage of a data on blob storage solution in a format called delta lake.

Data storage format

Delta lake is an open format build on Apache Parquet. It enables ACID (atomicity, consistency, isolation, durability) transactions for Big Data workloads. In delta lake all changes made to the data are captured in a serialized transaction log, protecting the data’s integrity and reliability and providing full, accurate audit trails. Delta Lake’s transaction log provides a master record of every change made to the data, which makes it possible to recreate the exact state of a data set at any point in time. Data versioning makes data analyses and experiments completely reproducible. The quality and consistency of the data is protected with robust schema enforcement, ensuring that data types are correct and complete and preventing bad data from corrupting critical processes. This file format also supports data manipulation language (DML) operations including merge, update, and delete commands for compliance and complex use cases such as streaming upserts, change-data-capture, slowly-changing-dimension (SCD) operations.

Big advantage of Delta lake is that it supports ‘read_as_stream’ method. This allows to pipeline switch into streaming mode if the source update frequency increases, without any code modifications.

Data storage architecture

Data is stored according to data design pattern called medallion architecture. This pattern is used to logically organize data into the layers, improving the structure and quality of data as it flows through each layer of the architecture.

There are 3 layers for the data:

  • Bronze - Raw Ingestion
  • Silver - Filtered, Cleaned, Augmented
  • Gold - Business-level data

Invalid Image Path

This approach allows monitoring of the flow of the data through the pipeline on every major stage.

Consequences

Pros:

  • The same code to process both stream and batch workflows
  • No need for external data management tools, data storage API is provided out of the box by delta file format.
  • Low costs - The only costs are generated by GCP bucket storage and Spark Cluster.
  • Flexibility regarding input file format
  • Flexibility regarding input update frequency (support up to 1ms)
  • Built-in support for ACID transactions
  • Built-in support for data change tracking
  • Update limits are bounded to GCP bucket file update limits
  • Unlimited partitions for each table
  • Possibility to provide scoped access to any of the tables for 3rd party. Delta Lake supports Cloud Share which integrates will all modern data processing / analysis solution
  • Modular approach allow scoped changes.

Cons:

  • This format is native to Databricks - it requires some work to integrate it with google Dataproc
  • No native geospatial engine. Apache Sedona is not as mature as GCP BigQuery geospatial engine or PostGis.

Alternatives Considered

  1. Bigquery based data warehouse / data lake

Pros:

  • Cloud native solution
  • Simple integration with other GCP tools
  • Built-in geospatial engine

Cons:

  • Limited data source file format support (for example, json must be converted to ndjson)
  • Transformations are applied via SQL which is hard to maintain during product lifecycle
  • Limits and quotas
  • Table partitioning limits
  1. Data Lake solution with parquet files on GCS

Pros:

  • Every file format supported by spark is supported by this pattern
  • Unlimited partitions / file operations

Cons:

  • Parquet files does not provide ACID transformations
  • Parquet files does not support concurrent writing (_SUCCESS flag blocks concurrency)
  • Requires another layers for data change history tracking
Last updated on