GCP GKE Architecture
Originally
0006-GCP-GKE-ARCHITECTURE (v2) · Source on Confluence ↗References
GCP GKE architecture
Context and Problem Statement
The goal is to build and support cost-efficient, scalable, resilient, and fault-tolerant infrastructure that can be used to host different solutions and products.
Our current architecture is documented in the following [ADR](confluence-title://PE/ADR20: GCP resources and environments) and implements Many-to-many option where each product/league has its own set of GCP projects and GKE clusters (dev/prod).
Our experience with this approach has identified the following areas which require improvement:
- Number of clusters to maintain (OpEx)
- Cost of the GKE control plane for each cluster (CapEx)
- Cost of underutilized compute resources (CapEx)
- Services experience avoidable latency in the existing multi-cluster architecture
Our experience with this approach has benefitted DroneUp in the following areas:
- Our security posture is improved by isolating resources in separate GCP projects
- Our security posture is improved because teams can configure IAM quickly in separate GCP projects
- Billing can be tracked for each project/product separately
- All terraform resources for one product are controlled in one place
Decision Outcome
Many-to-one option is selected where:
- all GKE workloads are deployed in one set of shared, centralized clusters
- all other resources are controlled within existing projects
Invalid Image Path
Left part represents a shared GCP project that hosts shared GKE clusters for dev and prod including shared supportive resources:
Central Observability stack:
- Logging
- Metrics
- Tracing
- Alerting
Artifact registry
Networking
- Cato integration
- Firewall
- Backup of sunsetted tool (?)
Each clusters will contains number of namespaces to separate workloads for different products and maintenance workloads (service-mesh, cert-manager, external-dns, ingress(-es)).
Creation of clusters and post-creation setup will be controlled by 2 separate repositories (4 TFC workspaces) to reduce overhead and dependency management issues.
Existing product infrastructure repositories will stay as it is (right part on diagram), but will be updated to enable an option of deploying resources into shared clusters in individual namespace, for example cloudsql proxy or workload-identity accounts.
Once all applications are migrated into shared clusters, existing k8s clusters will be deprovisioned.
User experience to migrate the service into shared GKE cluster is documented here.
Until this ADR is implemented services can be migrated between existing clusters to reduce number of clusters and cost. Approach to migrate between clusters is documented here
Reliability
Limit Ranges policy will be implemented cluster-wide to preset requests/limits, if they aren’t configured
Security
Security concerns to be addressed (k8s and GCP level):
- mTLS (k8S) - Needs to be enabled on the cluster with service-mesh, so then all traffic inside of the cluster is encrypted
- Zero trust - do not rely on networking and firewall, but empower application to trust no one and require authentication/authorization, then destination where application is deployed will never matter
Consequences
Pros:
- Limited number of clusters reduces maintenance overhead to keep all of them in sync and up-to-date
- Limited number of clusters reduces cost of running GKE engine
- Limited number of clusters reduces cost on compute, because nodes can be reused and scaled more effectively
- Services in the same shared cluster will reach to each other over single private subnet without leaving GCP and additional firewall setup
- MTLS with linkerd between all services in the cluster will provide built-in encryption (less overhead than setting up cross-cluster service-mesh)
- Limit Ranges will ensure that all the workloads have a resources requests/limits set, so that heavy services will not produce overload on neighbors on the same node
- IAM separation between products
- Current standard GCP project setup stays as it is and doesn’t need rework/migration
Cons:
- First time IAM setup and workload identity deployment can produce some challenges during implementation to provide application with permissions to call GCP service in different project (GKE pod -> PubSub topic)
- Cross-project infrastructure deployments can make troubleshooting more time-consuming, but storing all logs and metrics in central place can reduce confusion and ease connecting the dots (eg. GKE in one project, DB/topics/etc in another)
- Cross-namespace communication between services will exist
Alternatives Considered
Considered alternatives with pros and cons are documented in the previous ADR ADR20: GCP resources and environments
Out of scope
- Investigate/improvement path for establishing connection between services and corresponding Cloud SQL database, use private network to lower cost/latency
- Cross-cluster connection from current clusters to new shared one
- Multiple node pools in shared clusters with labels and selectors to separate workloads
- Allow/blocklisting the connections across namespaces in shared clusters with Authorization policy, all connections will be allowed by default