GCP GKE Architecture

Andi Lamprecht ·2026-05-11· 4 min read· Accepted

ADR-0144 · Author: GitHub Service Account (Deactivated) · Date: 2025-02-13 · Products: platform
Originally 0006-GCP-GKE-ARCHITECTURE (v2) · Source on Confluence ↗

References

GCP GKE architecture

Context and Problem Statement

The goal is to build and support cost-efficient, scalable, resilient, and fault-tolerant infrastructure that can be used to host different solutions and products.

Our current architecture is documented in the following [ADR](confluence-title://PE/ADR20: GCP resources and environments) and implements Many-to-many option where each product/league has its own set of GCP projects and GKE clusters (dev/prod).

Our experience with this approach has identified the following areas which require improvement:

Number of clusters to maintain (OpEx)
Cost of the GKE control plane for each cluster (CapEx)
Cost of underutilized compute resources (CapEx)
Services experience avoidable latency in the existing multi-cluster architecture

Our experience with this approach has benefitted DroneUp in the following areas:

Our security posture is improved by isolating resources in separate GCP projects
Our security posture is improved because teams can configure IAM quickly in separate GCP projects
Billing can be tracked for each project/product separately
All terraform resources for one product are controlled in one place

Decision Outcome

Many-to-one option is selected where:

all GKE workloads are deployed in one set of shared, centralized clusters
all other resources are controlled within existing projects

Invalid Image Path

Corresponding Miro board

Left part represents a shared GCP project that hosts shared GKE clusters for dev and prod including shared supportive resources:

Central Observability stack:
- Logging
- Metrics
- Tracing
- Alerting
Artifact registry
Networking
- Cato integration
- Firewall
- Backup of sunsetted tool (?)

Each clusters will contains number of namespaces to separate workloads for different products and maintenance workloads (service-mesh, cert-manager, external-dns, ingress(-es)).

Creation of clusters and post-creation setup will be controlled by 2 separate repositories (4 TFC workspaces) to reduce overhead and dependency management issues.

Existing product infrastructure repositories will stay as it is (right part on diagram), but will be updated to enable an option of deploying resources into shared clusters in individual namespace, for example cloudsql proxy or workload-identity accounts.

Once all applications are migrated into shared clusters, existing k8s clusters will be deprovisioned.

User experience to migrate the service into shared GKE cluster is documented here.

Until this ADR is implemented services can be migrated between existing clusters to reduce number of clusters and cost. Approach to migrate between clusters is documented here

Reliability

Limit Ranges policy will be implemented cluster-wide to preset requests/limits, if they aren’t configured

Security

Security concerns to be addressed (k8s and GCP level):

mTLS (k8S) - Needs to be enabled on the cluster with service-mesh, so then all traffic inside of the cluster is encrypted
Zero trust - do not rely on networking and firewall, but empower application to trust no one and require authentication/authorization, then destination where application is deployed will never matter

Consequences

Pros:

Limited number of clusters reduces maintenance overhead to keep all of them in sync and up-to-date
Limited number of clusters reduces cost of running GKE engine
Limited number of clusters reduces cost on compute, because nodes can be reused and scaled more effectively
Services in the same shared cluster will reach to each other over single private subnet without leaving GCP and additional firewall setup
MTLS with linkerd between all services in the cluster will provide built-in encryption (less overhead than setting up cross-cluster service-mesh)
Limit Ranges will ensure that all the workloads have a resources requests/limits set, so that heavy services will not produce overload on neighbors on the same node
IAM separation between products
Current standard GCP project setup stays as it is and doesn’t need rework/migration

Cons:

First time IAM setup and workload identity deployment can produce some challenges during implementation to provide application with permissions to call GCP service in different project (GKE pod -> PubSub topic)
Cross-project infrastructure deployments can make troubleshooting more time-consuming, but storing all logs and metrics in central place can reduce confusion and ease connecting the dots (eg. GKE in one project, DB/topics/etc in another)
Cross-namespace communication between services will exist

Alternatives Considered

Considered alternatives with pros and cons are documented in the previous ADR ADR20: GCP resources and environments

Out of scope

Investigate/improvement path for establishing connection between services and corresponding Cloud SQL database, use private network to lower cost/latency
Cross-cluster connection from current clusters to new shared one
Multiple node pools in shared clusters with labels and selectors to separate workloads
Allow/blocklisting the connections across namespaces in shared clusters with Authorization policy, all connections will be allowed by default

Last updated on May 11, 2026

GCP GKE Architecture Terraform Architecture