Skip to content
Airchitect ATLAS Arichitect 1 Realtimepipeline

Airchitect ATLAS Arichitect 1 Realtimepipeline

Andi Lamprecht Andi Lamprecht ·· 2 min read· Accepted
ADR-0127 · Author: Sybil Melton · Date: 2025-02-07 · Products: platform
Originally ADR-0054 AIRCHITECT_ATLAS_ARICHITECT_1_RealTimePipeline (v3) · Source on Confluence ↗

Atlas Airchitect Real-Time Pipeline

Context

Atlas ETL architecture is based on data processing pipelines centered around a data lake.
The pipelines are orchestrated via Cloud Composer (Apache Airflow) and stored on GCS as delta tables and processed by Dataproc (Apache Spark). This approach provides reliable and scalable processing for huge data sets, however each element of this stack introduces additional latency to the system measured in seconds.

Current data lake processing flow

Invalid Image Path

Once the data lands in the data lake it cannot be called real-time anymore. This setup cannot achieve sub-second processing of a single data row (eg. injecting of NFZ).

Decision

Invalid Image Path

Instead of processing a dataset as a batch it can be treated as stream of messages. Each message can be then independently processed by streaming ETL and propagated to App layer and Data lake after the processing is done. Thanks to this approach the data does not lose it’s real-time properties, since interacting with data lake is not handled during the data processing, but in parallel to Application Layer interaction

Example real time flow for Airspaces Pipelines looks as below:

Invalid Image Path

Consequences

  • The interaction with data lake (eg. writes/reads from GCS) are no longer bottle neck for data processing
  • The 3rd party tools latency (eg. Cloud Composer delay) are not longer bottle neck for data processing
  • Data Pipeline is decoupled from Data Lake
  • Keeping data pipeline closely coupled to data lake guaranteed it’s consistency, this approach requires working through new mechanism to satisfy that (eg. Pub/Sub replays)

Alternatives Considered

Streaming Pipelines within Data Lake

Spark has capabilities to run a streaming pipeline within data lake, however this approach has some negative side effects:

  • Google does not support Pub/Sub integration with Spark (only Pub/Sub Lite)
  • Spark is distributed processing engine designed to process large datasets, and to achieve this it has some processing overhead (eg. workers management). This adds unnecessary latency for smaller datasets
  • Complex pipelines could require interaction with Data Lake, which adds unnecessary latency.

Links

[1] When to use reverse ETL

[2] Kappa architecture

Last updated on