Entry 004 · 03/20/2024 · 3 min read
Progressive Delivery with Argo Rollouts
Implementing canary deployments and automated rollbacks using Argo Rollouts with Prometheus analysis templates.
The fundamental promise of continuous deployment is that you can release frequently without increasing risk. The catch is that this only works if your rollout strategy is sophisticated enough to catch problems before they affect all users.
Kubernetes Deployments give you rolling updates. That's a start, but it's not enough. You need canary releases, traffic splitting, and automated rollback—and you need them tied to real production metrics, not just liveness probes.
Why Standard Rolling Updates Aren't Enough
A standard rolling update replaces pods one by one, waiting for each to become healthy before proceeding. "Healthy" means the pod passes its readiness probe—which typically just means the HTTP server is responding.
But a pod can be responding to readiness probes while serving elevated error rates, higher latency, or subtly wrong responses. By the time your monitoring alerts, you might be 80% into a bad rollout.
Argo Rollouts Architecture
Argo Rollouts replaces the standard Deployment controller with a Rollout resource that supports weighted traffic shifting, analysis runs, and pause/resume gates.
A typical canary configuration looks like:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: error-rate
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100The key is the analysis step. Before advancing from 5% to 25% traffic, the controller runs an AnalysisTemplate that queries Prometheus and evaluates whether the canary is healthy.
Writing Good Analysis Templates
The default success rate template from the documentation is a reasonable starting point, but production templates need more nuance:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
spec:
metrics:
- name: error-rate
successCondition: result[0] <= 0.01
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
job="{{args.service-name}}",
status=~"5..",
version="{{args.canary-hash}}"
}[5m])) /
sum(rate(http_requests_total{
job="{{args.service-name}}",
version="{{args.canary-hash}}"
}[5m]))Note the failureLimit: 2. This allows transient Prometheus query failures without triggering a rollback. In practice, Prometheus availability is rarely 100%, and you don't want rollback logic that's brittle to your metrics infrastructure.
The Rollback Story
Automated rollback is the feature people want most and trust least. The concern is valid: you don't want a spurious alert to roll back a deployment at 3am.
Our approach is to make rollback automatic for clear failures (error rate > 5%, p99 latency > 3x baseline) and require human intervention for ambiguous signals. The Argo Rollouts dashboard provides one-click rollback for on-call engineers when the automated analysis says "degraded but not failed."
We've done approximately 400 deployments with this setup. There have been 12 automated rollbacks, all of which were correct. There have been three false negatives where a bad deploy made it to full rollout before being caught by alerting—all three happened because the failure mode wasn't captured in our analysis templates. We added the missing metrics each time.
Traffic Splitting at the Ingress Layer
Argo Rollouts supports multiple traffic routing providers. We use nginx-ingress with header-based routing for internal testing before any percentage-based canary promotion. This lets engineers verify the canary against production data without exposing any real user traffic.
The combination of header routing for verification and percentage-based canary for gradual rollout has been the most reliable pattern we've found.