QA ATLAS Releasestrategy QA LVL4 Dataqualitytests
Originally
ADR-0051 QA_ATLAS-ReleaseStrategy-QA-LVL4-DataQualityTests (v7) · Source on Confluence ↗Release Strategy - QA - Level 4 - Data Quality Tests
Context
Data quality tests play a pivotal role in today’s data-driven world, serving as the guardians of data integrity and reliability. These tests encompass a wide array of assessments, each focused on ensuring that data is accurate, consistent, complete, and trustworthy. They are applied to dataset as a whole, testing each row of a data to ensure maximum total quality.
Those tests informs about dataset properties such as:
Data consistency
These checks assess whether all required data elements are present in a dataset. It involves verifying if there are any missing values or empty fields as well as ensure that data is uniform and follows an expected format or structure throughout the dataset.
Validity Tests
Validity tests assess whether data conforms to predefined constraints.
Data Profiling
Data profiling involves generating statistical summaries, histograms, and distribution analyses to understand the characteristics of the data. It can reveal anomalies and outliers.
Data Anomaly Detection
Advanced data quality tests may involve machine learning algorithms to detect anomalies and outliers that human-driven tests may miss.
Application
Atlas performs data quality tests on 2 levels:
- Incoming data
- Output data
Incoming data
Atlas lack control over the consistency and validity of the data incoming from the external sources. On this stage data quality tests can be used to constant monitoring of the alignment of the incoming data to the assumptions made on it during the system’s design.
Output data
Checks on this stage serves as a proof for Atlas consumers that the data has a high quality and it’s format is aligned to the one specified within a data contract.
Main goals of this test level
Data quality tests aim to ensure the accuracy and reliability of data by identifying and rectifying errors, inconsistencies, and anomalies within the dataset. These tests serve a dual purpose, validating both the incoming data from external sources and the output of our data pipeline to adhere to agreed data contracts. Ultimately, the objective is to furnish stakeholders with dependable, high-quality data they can trust for their needs.
This test level gives an answer to the question:
Is the data incoming from external sources aligned with our expectations for its format and quality?
Is our system design to support the data incoming from external sources?
Is the system generating data in compliance with the agreed-upon data contract?