ATLAS Releasestrategy 7 Datahandling
Originally
ADR-0047 ATLAS-ReleaseStrategy-7-DataHandling (v3) · Source on Confluence ↗Release Strategy - Data Handling
Handling data across three environments - development, staging, and production - involves meticulous care, especially when dealing with two distinct types of data:
- sensitive
- nonsensitive.
Nonsensitive data
For nonsensitive data, a streamlined approach is adopted, where the same dataset flows seamlessly through all three environments. This approach allows to fine-tune solutions to that dataset that are based on real use-cases and speeds up the data discovery phase.
This ends up with following data structure:
| Environment | Data values | Dataset size values |
|---|---|---|
| Development | real data | full dataset |
| Staging | real data | full dataset |
| Production | real data | full dataset |
Sensitive data
When it comes to sensitive data, a heightened level of security and privacy measures are implemented. In the development environment, sensitive data undergoes anonymization.
Data anonymization must be applied before ingestion of the sensitive data into the dev environment
An example of data anonymization ensuring that personally identifiable information is replaced with pseudonymous or placeholder values. The same should be applied to any location data that could not be processed directly. In the staging and production environments, the actual sensitive data is processed, but within a fortified security framework, safeguarding it against unauthorized access and breaches. This careful data handling strategy guarantees that sensitive information remains protected throughout its lifecycle while enabling robust development and testing processes in a controlled environment.
With that approach, the data solution cannot be fully fine tuned during the development phase, since the developers might be missing some dataset information, but is compliant with the regulations.
| Environment | Data values | Dataset size values |
|---|---|---|
| Development | anonymized data | full dataset |
| Staging | real data | full dataset |
| Production | real data | full dataset |
Data anonymization
At its core, data anonymization is a technique employed to safeguard the privacy and confidentiality of sensitive information while still allowing for meaningful analysis and research.
There are multiple techniques of data anonymization, below are few examples of it:
- Data Masking/Redaction:
- Original Data: John Doe’s social security number: 123-45-6789
- Anonymized Data: John Doe’s social security number: XXX-XX-XXXX
- Generalization:
- Original Data: Exact birthdate (e.g., 1990-05-15)
- Anonymized Data: Birth year (e.g., 1990)
- Original Data: Exact location (e.g., lat: 34.0522 lon: -118.2437)
- Anonymized Data: Geohash that contain this location (e.g., 9q5exr3h)
- Pseudonymization:
- Original Data: Full names (e.g., Sarah Johnson)
- Anonymized Data: Assigning unique pseudonyms (e.g., User1, User2)
- Tokenization:
- Original Data: Credit card number (e.g., 1234-5678-9012-3456)
- Anonymized Data: Replaced with a token or unique identifier (e.g., TOKEN-123456)
Aggregation:
- Data Encryption:
- Encrypting data in such a way that it can only be decrypted with a specific key, protecting the data’s confidentiality.
- Noise Addition
- Original Data: Precise location coordinates
- Anonymized Data: Adding random noise to coordinates to obfuscate the exact location