GitOps at Scale: Managing 50 Clusters with Flux

When you're managing a handful of Kubernetes clusters, ad-hoc kubectl apply commands and shell scripts can get you surprisingly far. At five clusters, things start to feel fragile. At fifty, they fall apart entirely.

This is the story of how we moved from a state of configuration drift and deployment anxiety to a fully reconciled GitOps model using Flux v2, a single monorepo, and a small amount of discipline.

The Problem with Imperative Infrastructure

Before GitOps, our cluster state lived in people's heads and in CI pipeline scripts scattered across a dozen repositories. Deployments were events, not facts. You couldn't look at a cluster and know with certainty what was supposed to be running there, let alone why.

The classic symptoms were all present: "works on staging, broken in prod" because someone had applied a hotfix directly without committing it. Clusters that had been "temporarily" patched six months ago and nobody remembered how to reconcile them. Fear of touching anything because the blast radius was unknowable.

Choosing Flux v2

We evaluated both ArgoCD and Flux before committing. ArgoCD has a better UI and stronger RBAC story, but Flux's pull-based architecture and first-class support for Kustomize overlays made it a better fit for our monorepo strategy.

Flux treats your Git repository as the source of truth and continuously reconciles cluster state toward what's declared there. If someone manually changes a resource, Flux reverts it. This isn't just convenient—it's a cultural forcing function.

The Monorepo Structure

Our repository structure follows a "fleet" pattern:

infrastructure/
├── base/               # Shared Kustomize bases
│   ├── cert-manager/
│   ├── ingress-nginx/
│   └── monitoring/
├── clusters/           # Per-cluster overlays
│   ├── prod-us-east/
│   ├── prod-eu-west/
│   └── staging/
└── apps/              # Application workloads
    ├── base/
    └── overlays/

Each cluster directory contains a flux-system bootstrap and Kustomization resources pointing at the appropriate app overlays. Cluster-specific values—resource limits, replica counts, regional endpoints—live in the overlay.

Progressive Rollouts

One of the less-discussed benefits of GitOps is that it makes progressive rollouts trivial to reason about. We use image automation controllers to automatically update image tags on merge to main, then rely on Flagger for canary analysis before full rollout.

The entire state machine is visible in Git: you can see exactly when an image tag was bumped, when canary analysis started, and whether it promoted or rolled back.

What We Learned

After six months operating this way across 50 clusters, a few things stand out. First, the initial investment in structuring the monorepo correctly is significant—don't rush it. Second, Flux's alerting is underrated; getting Slack notifications when reconciliation fails is genuinely useful. Third, secret management remains the hardest problem. We use Sealed Secrets but are evaluating External Secrets Operator.

The biggest win isn't operational—it's cognitive. New engineers can understand the entire cluster estate by reading files in a single repository. That alone has been worth the migration cost.