DATA Pipeline ATLAS 9 Endtoendtests
Originally
ADR-0069 DATA_PIPELINE_ATLAS-9-EndToEndTests (v3) · Source on Confluence ↗End-to-End Tests
Context
As an Atlas engineer, I need to have end-to-end tests for my data pipelines to have demonstrable proof that our system is operating as designed.
Data pipeline testing is evolving quickly but there is still a big gap to bridge with software engineering best practices.
The data engineers current prevailing opinion is that the E2E testing is not worth the effort
(mainly because it takes a lot of time to set up the system and prepare the test data), and they rely on data quality checks and unit tests.
Data pipelines are hard to test, as you always need a fully deployed system, prepared in advance data set and mocks of external services.
Also, it’s unique and differ from software application and the common E2E frameworks don’t really match the data use case.
Despite the common data approach, Atlas wants to break the general attitude and follow software engineering best practices.
Decision
Atlas E2E tests overview - Miro board
The general idea is to run the current data pipelines with prepared test data and check if the output is as expected.
Atlas data pipelines simple overview:
Invalid Image Path
Main assumptions (for more, see the Miro board):
- Store code in separate Github repository
- Dedicated Cloud Composer instance + dedicated GCS bucket (decouple test pipelines from data pipelines)
- Orchestrate and trigger scenarios by Airflow DAG (pros: scheduling, observability and exceptions handling). Run jobs in K8s pods.
- Test staging env only
- Daily schedule + trigger after each deploy + manual trigger
- Generate json report + Slack alerts
Main blocks of the system:
- Data pipeline (add a feature flag to data DAGs which will allow to skip unwanted tasks in the E2E tests (e.g. download the FAA data, archive data))
- E2E test DAG: implements the scenario, run the scenario, generate reports + scheduling, observability
- Test data generator
- Jira’s management / E2E framework (Zephyr, Cucumber, Xray)
- Slack reporting
E2E test flowchart:
Invalid Image Path
Scope of testing level:
- Data transformations
- Job failures
- Slack reports
- SLA (service level agreement)
Due to limited resources, Atlas will implement the E2E system in iterations:
First iteration - Testing level: Data transformations
- Adjust current data pipelines - feature flag for skipping unwanted tasks
- Implement the E2E test DAG
- Implement the test data generator
Second iteration
- Jira management
- Slack reporting
- JSON reports
Next iterations - add another testing levels
Next iterations - include pub/sub and Themis into tests to have full UTM path covered by the E2E test
Consequences
- Atlas would need engage significant amount of time to create the E2E tests system and the tests itself.
Alternatives Considered
- Don’t implement the E2E tests and rely on data quality checks, unit tests and integration tests