Skip to content

Multi-Instance Infrastructure

Andi Lamprecht Andi Lamprecht ·· 7 min read· Draft
Traceability Links
Jama Requirements
Jira TasksCORE-2083

Context

The requirement to onboard multiple Organizations and Tenants creates the need to re-organize product architecture and infrastructure components to allow smooth usage of our system on different levels and establish safe and reliable Release process to keep us and our clients compliant with all required regulations.

Decision Drivers

  • Group all key components and applications into a Global Production Release for auditing, traceability and ease of deployment.
  • Maintain multiple Release Candidates in Global Production Release in parallel so each agent can control when they switch versions.
  • Organize multi-tenant infrastructure where different Agents will be working on a shared platform without impacting each other or noticing each other’s presence.
  • Define the shared services that run as a single replica and provide platform capabilities for multiple agents.
  • Optimize resource consumption across Agents and Organizations.
  • Deploy new organizations that can provide services for different subsets of agents.

Scope Out

  1. Logical separation of multiple Tenants under one Organization.
  2. Environment-Controller architecture and capabilities: the component responsible for preparing and deploying Global Production Release, enforcing approval gates, providing traceability of changes from commit hash to Jira Story, Jama requirement and V&V evidence.
  3. Separating Instances onto individual GKE clusters: 2 GKE clusters (dev and prod) will host all workloads for all Organizations (Instances) instead. Partner Organization Instances will always be on a prod cluster (no matter what the Instance name will be — test, stg, prod, etc.) and internal development Instances will be hosted mostly on dev cluster with some exceptions like DroneUp Staging installation on prod.

Decision

Adopt a hybrid multi-tenant platform architecture in which:

Multi-instance deployment is orchestrated by the Environment-Controller (Argo Workflows plus a Kubernetes Operator) where one Instance consists of:

  • A set of Release Candidates in Global Production Release to allow version switching on demand without redeployment, packaged as immutable Helm releases that can be pinned per Tenant to meet DO-178C-driven configuration control and multi-version support requirements.
  • A Shared platform services bundle (identity, observability, UTM integration, shared data pipelines, operational API) operating in a global control plane with clearly defined tenancy-aware interfaces, deployed using a separate Software Profile.
  • A Shared infrastructure bundle: persistence layer that provides shared storage to organize data using logical separation and multi-tenant setup awareness. For example, the same Database will be called by mission-service v1.1 and mission-service v1.2, which puts a constraint on the development process to never introduce breaking changes (captured in the ADR).

Isolation between Instances is enforced through GKE boundaries (network policies and/or service mesh) and tightly scoped service-account permissions to prevent cross-project or cross-instance resource access.

Isolation between Tenants is provided by identity management and fine-grained permission controls rather than infrastructure segmentation.

    graph TB
    subgraph Instance["Organization = Instance ~ DroneUp Production / Partner Production"]

        subgraph Parent["Parent (DroneUp)"]
            direction TB
        end

        subgraph Shared["Shared Services"]
            S1["atlas"]
            S2["UTM shared features"]
            S3["operational-api"]
            S4["etc"]
        end

        subgraph Persistence["Shared Persistence / Infrastructure"]
            P1["DBs"]
            P2["PubSub topics"]
            P3["Redis"]
            P4["etc"]
        end

        subgraph Versions["Multi Versions = Dependency Helm-Charts"]
            subgraph Q2["q2-2025"]
                Q2a["mission-service v2.2.3
                uncrew-frontend v1.0.3
                hubops-frontend v2.2.4
                utm-mission-service v4.2.4
                etc."]
            end
            subgraph Q1["q1-2025"]
                Q1a["mission-service v2.0.0
                uncrew-frontend v1.0.3
                hubops-frontend v2.0.0
                utm-mission-service v3.5.4
                etc."]
            end
            subgraph Q4["q4-2024"]
                Q4a["mission-service v1.2.3
                uncrew-frontend v1.0.3
                hubops-frontend v2.0.0
                utm-mission-service v3.0.4
                etc."]
            end
        end

        subgraph Agents["Clients = Tenants = Agents (Logical Separation)"]
            A["Agent A (DroneUp)"]
            B["Agent B (Horizon Aerobotics)"]
            C["Agent C (Cherokee Nation)"]
        end

        Parent -.-> Versions
        Parent --> Shared
        Shared -.-> Persistence
        Versions -.-> Persistence

        A --> Q2
        B --> Q1
        C --> Q4
    end
  

Consequences

The requirement to support multi-instance, multi-versioning and multi-tenancy altogether brings complexity on both Operations and Software Development layers, but it will reduce cost by allowing hosting multiple Agents and software versions on the same infrastructure. It will also allow each client to move through versions independently without requiring operations involvement, because the deployment mechanism will be adding new versions of software without changing previous sets — the release decision will be on the client.

This doesn’t include shared applications and shared infrastructure, because they will be changing for all users at once and likely will require a separate communication process.

Migration from the current setup to the new environment-controller managed setup will require a staged approach and data migration.

Alternatives Considered

1. Multi-Instance per Agent with Multi-Versioning

Dedicated persistence infrastructure components per client, shared GKE cluster.

  • Pros: Strong isolation, simplified compliance per Tenant, ability to switch versions on demand.
  • Cons: Operational overhead scales linearly; lower resource usage but multiple components still run at low utilization; slow onboarding; deployment and release steps remain separate because of multi-versioning.
    graph TB
    subgraph Option2["Option 2: Agent = Instance"]
        subgraph Parent["Parent (DroneUp)"]
            direction TB
        end

        subgraph SharedSvc["Shared Services"]
            SS1["atlas"]
            SS2["UTM shared features"]
            SS3["operational-api"]
            SS4["etc"]
        end

        Parent --> SharedSvc

        subgraph AgentA["Agent A (DroneUp)"]
            subgraph PersA["Shared Persistence"]
                PA["DBs · PubSub · Redis · etc."]
            end
            subgraph VerA["Multi Versions = Dependency Helm-Charts"]
                VA1["V1.2.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VA2["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VA3["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
            end
        end

        subgraph AgentB["Agent B (Horizon Aerobotics)"]
            subgraph PersB["Shared Persistence"]
                PB["DBs · PubSub · Redis · etc."]
            end
            subgraph VerB["Multi Versions = Dependency Helm-Charts"]
                VB1["V1.2.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VB2["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VB3["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
            end
        end

        subgraph AgentC["Agent C (Cherokee Nation)"]
            subgraph PersC["Shared Persistence"]
                PC["DBs · PubSub · Redis · etc."]
            end
            subgraph VerC["Multi Versions = Dependency Helm-Charts"]
                VC1["V1.2.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VC2["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
                VC3["V1.3.3: mission-service
                uncrew-frontend · hubops-frontend
                utm-mission-service · etc."]
            end
        end

        SharedSvc --> AgentA
        SharedSvc --> AgentB
        SharedSvc --> AgentC
    end
  

2. Pure Multi-Instance per Agent per Version

Dedicated persistence infrastructure components per client per version, shared or dedicated GKE cluster.

  • Pros: Strong isolation, simplified compliance per Tenant, no restriction on introducing breaking changes if they are synced across components (persistence layer is accessible by one version at a time).
  • Cons: Operational overhead scales linearly; inefficient resource usage; slower onboarding; upgrade of client to newer version goes through multiple stages where version is first tested on lower environment Instance, then on-request uploaded to production Instance.
    graph TB
    subgraph Option3["Option 3: Agent Version = Instance"]
        subgraph Parent["Parent (DroneUp)"]
            direction TB
        end

        subgraph SharedSvc["Shared Services"]
            SS1["atlas"]
            SS2["UTM shared features"]
            SS3["operational-api"]
            SS4["etc"]
        end

        Parent --> SharedSvc
        AgentA_ref["Agent A (DroneUp)"] -.-> AProd
        AgentA_ref -.-> AStage
        AgentB_ref["Agent B (Horizon Aerobotics)"] -.-> BProd
        AgentB_ref -.-> BStage

        subgraph AProd["Agent A, V1.3.3, Prod (DroneUp)"]
            AP_Pers["Dedicated Persistence
            DBs · PubSub · Redis · etc."]
            AP_Ver["V1.3.3:
            mission-service
            uncrew-frontend
            hubops-frontend
            utm-mission-service · etc."]
            AP_Ver --> AP_Pers
        end

        subgraph BProd["Agent B, V1.5.3, Prod (Horizon Aerobotics)"]
            BP_Pers["Dedicated Persistence
            DBs · PubSub · Redis · etc."]
            BP_Ver["V1.5.3:
            mission-service
            uncrew-frontend
            hubops-frontend
            utm-mission-service · etc."]
            BP_Ver --> BP_Pers
        end

        subgraph AStage["Agent A, V1.4.3, Stage1 (DroneUp)"]
            AS_Pers["Dedicated Persistence
            DBs · PubSub · Redis · etc."]
            AS_Ver["V1.4.3:
            mission-service
            uncrew-frontend
            hubops-frontend
            utm-mission-service · etc."]
            AS_Ver --> AS_Pers
        end

        subgraph BStage["Agent B, V1.6.3, Stage1 (Horizon Aerobotics)"]
            BS_Pers["Dedicated Persistence
            DBs · PubSub · Redis · etc."]
            BS_Ver["V1.6.3:
            mission-service
            uncrew-frontend
            hubops-frontend
            utm-mission-service · etc."]
            BS_Ver --> BS_Pers
        end

        SharedSvc --> AProd
        SharedSvc --> BProd
        SharedSvc --> AStage
        SharedSvc --> BStage
    end
  

Formal Impact

All application and infrastructure components require review and integrating under Global Production Release with ad-hoc installation and shared infra components across multiple versions.

Appendix: Key Terms

Mostly taken from the Organizations & Tenancy concept.

TermDefinition
OperatorOperators hold certificates. For example, DroneUp has a Part 135 Certification that affords certain privileges. They have control about which versions of software are made available to tenants in Global Production Releases. They have operational control over their child-agents.
AgentAgents leverage an Operator’s certification (e.g. Part 135 Certification) for their operations. Agents must adhere to the GOM, GMM, and SMS of the Operator to be compliant. Both Agents and Operators are Tenants, but with different privileges.
TenancyA Tenant is a group of users who share a common access with specific privileges to the software instance.
Multi-TenancyMulti-tenancy is an architecture wherein a single occurrence of a software application serves numerous clients. It is the illusion of a standalone application.
InstanceSeparate and independent copies of a software application or service that run on the same physical or virtual infrastructure (also Environment).
Multi-InstanceThe same source code deployed into separate environments. These environments are isolated such that the services contained within are unable to communicate across environments.
Release BlueprintSpecifies repositories and components that constitute a manageable product line, includes name of components and repository URL.
Release CandidateSelection of specific versions for each component defined in a Release Blueprint.
Global Production ReleaseA selection of specific, versioned releases from multiple, independent software and infrastructure bundles.

Links

Last updated on