Writing Production-Grade Kubernetes Operators

The Kubernetes operator pattern is one of the most powerful extension mechanisms in the ecosystem. It's also one of the most frequently misunderstood—at least by people who've only read blog posts about it and haven't yet debugged one at 2am.

This post covers what the tutorials leave out.

What an Operator Actually Is

At its core, an operator is a controller that watches Kubernetes resources and reconciles cluster state toward some desired condition. The "operator pattern" combines a custom resource definition (CRD) with a controller that understands the operational logic for your application.

The key insight is that you're encoding human operational knowledge into software. An operator shouldn't just deploy your application—it should know how to scale it, handle failures, perform upgrades, and take backups. If your operator just translates a CRD into a Deployment, you've built a complicated YAML template engine.

The Reconciliation Loop

Every controller follows the same basic pattern: watch for changes, enqueue the affected object, dequeue and reconcile. The reconcile function receives a request, fetches the current state, computes the desired state, and applies the diff.

func (r *MyAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
 
    var myApp apiv1.MyApp
    if err := r.Get(ctx, req.NamespacedName, &myApp); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
 
    // Reconcile logic here
    if err := r.reconcileDeployment(ctx, &myApp); err != nil {
        return ctrl.Result{}, err
    }
 
    return ctrl.Result{}, nil
}

The function must be idempotent. It will be called repeatedly—on every change, on controller restart, and periodically via re-queue. If your reconcile function isn't safe to call multiple times with the same state, you have a bug.

Status Conditions Are Not Optional

A common mistake in operator development is treating the .status subresource as an afterthought. In production, status conditions are your primary debugging interface.

Use the standard condition types where they apply: Ready, Progressing, Degraded. Write human-readable messages. Include the reason codes that on-call engineers will actually search for in runbooks.

Finalizers and Deletion

If your operator creates resources outside the cluster—cloud storage buckets, database instances, DNS records—you need finalizers to handle cleanup on deletion. Without them, you'll leak resources every time someone deletes a custom resource.

The pattern is straightforward: add a finalizer on creation, run cleanup logic before removing it. The tricky part is making cleanup idempotent and handling partial cleanup failures gracefully.

Rate Limiting and Backoff

The default controller-runtime rate limiter is fine for development. In production, with hundreds of custom resources, you need to tune it. We've found that exponential backoff with a base of 5s and a max of 300s works well for most operators.

For operators that interact with external APIs, implement per-resource rate limiting to avoid hitting provider limits. This is especially important for operators that manage cloud resources.

Testing That Actually Catches Bugs

Unit tests with a fake client are necessary but not sufficient. The bugs that matter happen in the interaction between controller restarts, concurrent reconciliations, and slow external systems.

We run integration tests against a real cluster using envtest, and chaos tests that randomly delete and recreate resources. The chaos tests have caught more bugs than all our unit tests combined.