Airchitect ATLAS Airchitect 2 Dualwrite
Originally
ADR-0058 AIRCHITECT_ATLAS_AIRCHITECT_2_DualWrite (v3) · Source on Confluence ↗Atlas Airchitect Dual Write problem
Context
Atlas Airchitect is a service responsible to publish boundaries changes in real time. Boundary is an airspace created by user through Spaces App, in contrary to regular airspaces which are being created by authoritative sources and delivered through Atlas Airspace Pipeline. Airchitect consumers (eg. Airspace API) are integrated with it through Pub/Sub.
Since the data is generated by the users, our service becomes the main source of truth for it. To enable the data validation on API request level it needs database to maintain current state. With this approach request validation is done on request level instead of doing it on pubsub consumer in downstream process. This approach enables the user to receive immediate feedback about the request. Database is also required to maintain the history of boundaries created by users which are no longer active (archived boundaries). The Airchitect API can perform soft-delete on it’s database and propagate the pubsub delete message to it’s consumers, who can perform hard-delete in their datastores.
Since each change processed by Airchitect service needs to be written to the database and to the pubsub the service encounters dual write problem. If the service fails between the writes the data state in database will be inconsistent with the data state in pubsub.
Example scenarios:
- Processing Updates
- Without DB
A user request to patch the record with ID 123, but the record does not exist
Request comes to Airchitect, without the db it cannot validate if the request is correct, returns 201 code and forwards the message to subscribers
Subscribers reject the request since it’s invalid. The user is not informed about this
- With DB
A user request to patch the record with ID 123, but the record does not exist
Request comes to Airchitect it validates it’s correctness with internal DB, discovers that the ID 123 does not exist and returns 404 error to the user
- Supporting late joiners
Airchitect maintain the full state in DB, so it can retransmit it to new subscribers
Decision
Airchitect service implements the transactional outbox pattern [1]. The data is first written to database. It lands in the main table as well as transactional outbox table. After that the data is published to pub/sub and removed from outbox table. If the service fails after the database transaction the information will stay in transactional outbox and can be re-picked by background process responsible to re-publish failed messages. This approach guaranties data consistency between database and the endpoint, as well as at-least-once delivery model, which matches the pub/sub delivery model.
Example implementation:
Invalid Image Path
Endpoint process logics
The endpoint process starts a transaction on the database. It writes a data to main table then it stores the pub/sub message model in transactional outbox table, along with otel data and the timestamp.
- If app fails at this point transaction is rolled back and no data is preserved in database
The Endpoint publish the message
- If app fails at this point, the message stays in transactional outbox
The endpoint removes the message from outbox
- If app fails at this point the message stays in transactional outbox
Garbage collection process logics
- The process queries the database transactional outbox table every X seconds, grabs all of the remaining messages failed to be published which are older than minimum age (eg. 5 seconds)
- The process re-publish each message to the pub/sub
- The process remove the message from the outbox
Alternatives considered
1. Write to pub/sub first after the successful publish use a separate subscriber to write data back to database
In this approach we risk propagating a record which then fails to be written into the database. This can be especially important for update requests which can conflict with uniqueness constraints.
Additionally pub/sub has at-least-once delivery model with no guarantee of message order. This creates a need of writing additional code to convert this use cases.
2. Write to database first, use CDC tool (eg. Debezium or Datastream) to stream changes to pubsub
This approach introduces extra latency to the process, in current implementation we aim to be able to publish the data in real time.