Skip to content

IAM

Andi Lamprecht Andi Lamprecht ·· 4 min read· Draft
ADR-0269 · Author: Remek Zajac · Date: 2026-02-26 · Products: platform
Originally ADR--0139-IAM (v7) · Source on Confluence ↗

Title

Traceability Links
Jama RequirementsUERQ-CMP-102
Jira Tasks

Context

Decision

Realm Isolation

The term realm might be inspired by a Keycloak concept and entity that maps closely to what is required from ATOMx realms.

A Keycloak instances and all realms it manages are stored in a single database, where users from different realms are separated logically by a db column realm_id, which does not seem to meet the physical realm isolation requirements [UERQ-SYS-1985] & [UERQ-SYS-1511].

Screenshot 2026-02-18 at 12.40.44 PM.png

Mitigations may include enabling Row Level Security in some tables of the Keycloak database, prevent accidental/buggy cross-realm queries and thus provide a partial infrastructure level realm separation.

Deploying an org-dedicated Keycloak instance and database will be eyewateringly expensive.

Otherwise Keycloak realms should scale to thousands without posing performance problems. Since roles and attributes are managed independently across realms in Keycloak, realm provisioning should include provisioning shared roles and attributes to avoid configuration complexities. This can be done with Keycloak REST API. NOTE that multiple realms implies multiple signing keys and so ATOMx services will have to be configured with as many OpenID issuers as many orgs there are and reconfigured as orgs appear and disappear. This discovery must be implemented in ATOMx.

Token Revocation

It is required that an RBAC role or ABAC attribute is revoked with a semi-immediate effect, namely that When an organization’s entitlement is revoked or suspended:

(a) the IAM Service shall immediately deny new access decisions for roles gated by that entitlement,

(b) the IAM Service shall revoke tokens associated with sessions exercising roles gated by that entitlement within 60 seconds (±10 seconds),

(a) isn’t a problem in any off-the-shelf IdP solution such as Keycloak. (b) however is a problem as OpenID relies on token expiry to eventually revoke access. The ID Token exp claim states the time after which the token cannot be accepted.

image-20260226-074731.png

Any ATOMx service, when validating a token, shall first check the sub-level (user-level) revocations based on token’s sid and org-level revocations. We want to avoid revoking 5000 tokens that we might have in flight when an org-level revocation happens. Active user sessions can be retrieved from Keycloak REST API, while the org for org-level revocation is available in the revocation request being handled by the IAM Service.

def is_token_valid(token):
    current_version = redis.get(f"entitlement:version:org:acme")
    if token.org_version < current_version:
        return False  # ANY revocation happened since token issued
    if redis.get("entilement:sid:$%", token.sid ) != none:
        return False  # This session has been terminated
    return True

Redis is a good store for such short-lived, immediately delivered revocations and it will help enforcing the dont-use-after-logout constraints.

As Redis or IdP can be temporarily down, it is further required that revocations are monitored and failures are flagged and retried.

This can be solved either with a database attached to the IAM service and something akin to the outbox pattern, i.e.: when an admin requests a revocation, as long as we succeed to write it to a local database, that revocation is accepted. Its delivery is guaranteed by a separate process that first modifies the entitlements in Keycloak and then puts them to Redis. If either fails, that process retries in the next cycle, while the idempotency of said Keycloak and Redis writes takes care of processing the same revocation twice. Since each retry cycle needs to be spawned by hand-written code and logged, we could instead take advantage of a durable execution framework such as Restate or Temporal.

image-20260226-074747.png

Which would further aid in managing the onboarding of Authorities.

User Deprovisioned via SCIM

For all of this to work we have to also handle SCIM events coming from an external/federated IdP communicating that a user, whose token may still be in circulation, has been removed.

Consequences

What becomes easier or more difficult to do because of this change?

Alternatives Considered

Formal Impact

List any systems or services that are impacted by this architectural decision.

Cited by queries

Last updated on