IAM
Originally
ADR--0139-IAM (v7) · Source on Confluence ↗Title
| Traceability Links | |
|---|---|
| Jama Requirements | UERQ-CMP-102 |
| Jira Tasks |
Context
Decision
Realm Isolation
The term realm might be inspired by a Keycloak concept and entity that maps closely to what is required from ATOMx realms.
A Keycloak instances and all realms it manages are stored in a single database, where users from different realms are separated logically by a db column realm_id, which does not seem to meet the physical realm isolation requirements [UERQ-SYS-1985] & [UERQ-SYS-1511].

Mitigations may include enabling Row Level Security in some tables of the Keycloak database, prevent accidental/buggy cross-realm queries and thus provide a partial infrastructure level realm separation.
Deploying an org-dedicated Keycloak instance and database will be eyewateringly expensive.
Otherwise Keycloak realms should scale to thousands without posing performance problems. Since roles and attributes are managed independently across realms in Keycloak, realm provisioning should include provisioning shared roles and attributes to avoid configuration complexities. This can be done with Keycloak REST API. NOTE that multiple realms implies multiple signing keys and so ATOMx services will have to be configured with as many OpenID issuers as many orgs there are and reconfigured as orgs appear and disappear. This discovery must be implemented in ATOMx.
Token Revocation
It is required that an RBAC role or ABAC attribute is revoked with a semi-immediate effect, namely that When an organization’s entitlement is revoked or suspended:
(a) the IAM Service shall immediately deny new access decisions for roles gated by that entitlement,
(b) the IAM Service shall revoke tokens associated with sessions exercising roles gated by that entitlement within 60 seconds (±10 seconds),
(a) isn’t a problem in any off-the-shelf IdP solution such as Keycloak. (b) however is a problem as OpenID relies on token expiry to eventually revoke access. The ID Token exp claim states the time after which the token cannot be accepted.

Any ATOMx service, when validating a token, shall first check the sub-level (user-level) revocations based on token’s sid and org-level revocations. We want to avoid revoking 5000 tokens that we might have in flight when an org-level revocation happens. Active user sessions can be retrieved from Keycloak REST API, while the org for org-level revocation is available in the revocation request being handled by the IAM Service.
def is_token_valid(token):
current_version = redis.get(f"entitlement:version:org:acme")
if token.org_version < current_version:
return False # ANY revocation happened since token issued
if redis.get("entilement:sid:$%", token.sid ) != none:
return False # This session has been terminated
return TrueRedis is a good store for such short-lived, immediately delivered revocations and it will help enforcing the dont-use-after-logout constraints.
As Redis or IdP can be temporarily down, it is further required that revocations are monitored and failures are flagged and retried.
This can be solved either with a database attached to the IAM service and something akin to the outbox pattern, i.e.: when an admin requests a revocation, as long as we succeed to write it to a local database, that revocation is accepted. Its delivery is guaranteed by a separate process that first modifies the entitlements in Keycloak and then puts them to Redis. If either fails, that process retries in the next cycle, while the idempotency of said Keycloak and Redis writes takes care of processing the same revocation twice. Since each retry cycle needs to be spawned by hand-written code and logged, we could instead take advantage of a durable execution framework such as Restate or Temporal.

Which would further aid in managing the onboarding of Authorities.
User Deprovisioned via SCIM
For all of this to work we have to also handle SCIM events coming from an external/federated IdP communicating that a user, whose token may still be in circulation, has been removed.
Consequences
What becomes easier or more difficult to do because of this change?
Alternatives Considered
Formal Impact
List any systems or services that are impacted by this architectural decision.