Subscription Details
Originally
ADR-0073-SUBSCRIPTION-DETAILS (v4) · Source on Confluence ↗UTM Airspace Subscription
Invalid Image Path
Context
We have a complex pipeline consisting of multiple services designed to handle UTM airspace subscription updates. The pipeline needs to be reliable, performant, and maintainable.
Decision Drivers
- Data integrity
- Performance
- Scalability
- Fault tolerance
Decision
Continue with the current design, while addressing the challenges and questions listed in each component.
Components
1. Atlas Pipeline
Responsibilities
- Source of data
- Runs at semi-random intervals
- Creates, Updates, and Deletes sent to pub/sub
2. Charger
Responsibilities
Receives authenticated push updates from pub/sub
Create, update, and delete data in PostGIS airspace table
- airspace table is used for fast queries
Questions/Challenges
Pub/sub message format changes.
- Example: elevations were being sent as a single value in v1, v2 introduced a map for elevation to include unit and ref in addition to value
- We’ve implemented message versioning to handle schema changes
- All versions are currently being handled, we will support the last 2-3 versions in the future
- Versioning allows for the decoupling of deployment of the relevant producer and consumer services
Some messages have had invalid or un-parsable data, i.e., invalid or problematic geometries
Validations are done at the http endpoint connected to the pub/sub. Any invalid messages return an appropriate error response, and is observable via honeycomb
Data quality pipelines run periodically to check our datasets at each level in our medallion data architecture
- Great Expectations is used to monitor and alert (via slack)
Additional validations have been added as invalid messages are monitored and logged
Thread-unsafe parsing library was found to be a cause of the remaining invalid geometries
Reconciling all airspace data between pub/sub and PostGIS
Ensuring complete update propagation
How do we know we are getting all updates?
- Additional observability of the pipelines is needed
- Need for a proof-of-life mechanism, or notifications
3. UTM Sharing (Redis)
Responsibilities
Inform all pods that an airspace has been updated
Currently implemented with Redis
- lightweight
- offers a variety of options (streams, pub/sub, list)
- no configuration required
Challenges
Stream vs. pub/sub
Redis failure scenarios
- Updates skipped
- Clients need to be notified - close connection
- Maybe okay? Client can reconnect with offset on connection
Verifying completeness of received updates
- Periodically query Atlas airspace for recent changes and compare count with updates received
- Close client connections if problem is identified
4. UTM API Subscription API
Responsibilities
gRPC API for internal customers (but also appropriate for public use)
Offers offset parameter to request updated airspaces immediately
- important to back-fill lost connections
Continuous update stream
Questions/Challenges
Handling broken streams
- Inform client by closing stream
- Client can reconnect with offset to fill in missing data
Need for client ‘proof of life’
- Include update count in heartbeat message?
Questions for Review
- Are all challenges sufficiently addressed?
- Do we need a failover for Redis?
- Is there a need for more stringent proof-of-life mechanisms?
- Improved observability, particularly in the data pipelines