ArgoCD Powered Multi Cluster Kubernetes Architecture

RNREDDY
Sep 9
2 min read

You may have likely seen a similar architecture with one management cluster, multiple workload clusters, and GitOps for everything.

But here’s the real question. What happens when you try this in production?

Let me give you a quick walkthrough of what worked, what broke, and what actually helped when we ran this setup across 20+ clusters and 3 cloud providers.

Why we built this

We needed a way to offer isolated Kubernetes clusters for each team.

Not just VPC level isolation, but cluster level, app level, and access level.

We didn’t want to babysit clusters.

So we wired it like this:

Cluster creation through Git using CAPI

Rancher for policy and access control

ArgoCD to bootstrap clusters and sync app workloads

Git as the control surface

Sounds good? It was. Until things got real.

What Goes Wrong (And How to Prevent It)

1. Cluster Drift Between Repo and Reality

The Problem:

Clusters often diverge from the spec defined in Git due to manual patching or cloud specific quirks (e.g., Azure API differences vs AWS).

Fix:

Use CAPI+GitOps continuously not just for provisioning.

Add periodic drift detection. Tools like Cluster API Provider GCP + Kyverno policies help lock things down.

2. ArgoCD in Each Cluster Becomes a Management Nightmare

The Problem:

If every workload cluster has its own ArgoCD, upgrades and credential rotations can snowball.

Fix:

Run ArgoCD in a central cluster with external cluster secrets using [ArgoCD Cluster Secrets + Project scoped access].

Only use in-cluster ArgoCDs if tenancy or network boundaries force it.

3. Secrets Management Breaks GitOps

The Problem:

Application teams need secrets, but storing them in Git is a no-go. Centralized secrets engines don’t scale easily across multiple clouds.

Fix:

Integrate External Secrets Operator (ESO) with ArgoCD. Define secrets as resources but source them from Vault, SSM, or Secret Manager per cloud.

4. Version Skews Break Cluster Creation

The Problem:

Upgrading CAPI controllers or Rancher while maintaining backward compatibility is... painful.

Fix:

Test infra components like Rancher, CAPI, ArgoCD on dedicated ephemeral clusters before pushing specs to production. Maintain staging cluster groups per cloud.

Tip for Scaling: Label Everything

Label clusters with environment=prod|dev, team=xyz, cost-center=abc.

ArgoCD projects and apps can then use selectors to auto target environments.

Rancher can leverage these for policy scoping too.

My personal experience says not everything that looks pretty on paper works that pretty, at least not on its own.

To scale platform engineering in a multi-cloud world, separate your concerns.

Control Plane (cluster management, policies)

Data Plane (app workloads)

Delivery Plane (ArgoCD and GitOps pipelines)

Get these boundaries right, and the setup becomes a force multiplier.

DevOps On Fly

ArgoCD Powered Multi Cluster Kubernetes Architecture

Recent Posts

Comments