Airchitect ATLAS Arichitect 1 Realtimepipeline
Originally
ADR-0054 AIRCHITECT_ATLAS_ARICHITECT_1_RealTimePipeline (v3) · Source on Confluence ↗Atlas Airchitect Real-Time Pipeline
Context
Atlas ETL architecture is based on data processing pipelines centered around a data lake.
The pipelines are orchestrated via Cloud Composer (Apache Airflow) and stored on GCS as delta tables and processed by Dataproc (Apache Spark). This approach provides reliable and scalable processing for huge data sets, however each element of this stack introduces additional latency to the system measured in seconds.
Current data lake processing flow
Invalid Image Path
Once the data lands in the data lake it cannot be called real-time anymore. This setup cannot achieve sub-second processing of a single data row (eg. injecting of NFZ).
Decision
Invalid Image Path
Instead of processing a dataset as a batch it can be treated as stream of messages. Each message can be then independently processed by streaming ETL and propagated to App layer and Data lake after the processing is done. Thanks to this approach the data does not lose it’s real-time properties, since interacting with data lake is not handled during the data processing, but in parallel to Application Layer interaction
Example real time flow for Airspaces Pipelines looks as below:
Invalid Image Path
Consequences
- The interaction with data lake (eg. writes/reads from GCS) are no longer bottle neck for data processing
- The 3rd party tools latency (eg. Cloud Composer delay) are not longer bottle neck for data processing
- Data Pipeline is decoupled from Data Lake
- Keeping data pipeline closely coupled to data lake guaranteed it’s consistency, this approach requires working through new mechanism to satisfy that (eg. Pub/Sub replays)
Alternatives Considered
Streaming Pipelines within Data Lake
Spark has capabilities to run a streaming pipeline within data lake, however this approach has some negative side effects:
- Google does not support Pub/Sub integration with Spark (only Pub/Sub Lite)
- Spark is distributed processing engine designed to process large datasets, and to achieve this it has some processing overhead (eg. workers management). This adds unnecessary latency for smaller datasets
- Complex pipelines could require interaction with Data Lake, which adds unnecessary latency.