← index

Entry 005 · 05/01/2024 · 3 min read

Data Pipelines on Kubernetes: Lessons from Airflow to Argo

A migration story from Airflow on VMs to cloud-native data pipelines using Argo Workflows and what we'd do differently.

Our data pipeline story starts where many do: Airflow running on a couple of VMs, a tangle of DAGs written by three different people with three different opinions about how DAGs should be written, and a lingering anxiety about what happens when the scheduler dies.

Two years later, we run all production data pipelines on Kubernetes using Argo Workflows. This is the migration story, including the parts we got wrong.

What Was Wrong with Airflow (For Us)

To be clear: Airflow is a mature, capable orchestrator. Our issues were specific to how we were running it, not fundamental problems with the software.

The core problem was that Airflow's execution model doesn't map cleanly to containerized workloads. The default executor runs tasks as processes on the scheduler node. The Kubernetes executor runs tasks as pods, which is better, but still requires Airflow's scheduler and webserver to be highly available, which added operational complexity we weren't staffed to handle properly.

We also had persistent state issues. Airflow's metadata database became a source of truth for pipeline state that was hard to reason about and harder to recover from when things went wrong.

Why Argo Workflows

Argo Workflows is Kubernetes-native in a way that Airflow is not. Each workflow is a Kubernetes resource. Each step is a pod. The state lives in the Kubernetes etcd cluster, which you're already running and operating.

This meant we could apply the same operational patterns we used for application workloads—GitOps, RBAC, namespace isolation, resource quotas—to our data pipelines. The cognitive overhead dropped significantly.

The Migration

We ran Airflow and Argo in parallel for four months. Every pipeline that was scheduled in Airflow had a corresponding Argo WorkflowTemplate that ran on the same schedule via Argo's CronWorkflow resource.

We monitored both for correctness over a two-week window before cutting over. For pipelines with external dependencies or complex fan-out logic, we extended the parallel run period.

The hardest pipelines to migrate were the ones that relied on XCom to pass state between tasks. Argo's equivalent—output parameters and artifacts—requires more explicit data flow specification, which is architecturally better but more work to implement.

What We'd Do Differently

The migration took longer than expected because we underinvested in a proper Argo Workflow template library. We ended up with a lot of duplicated boilerplate before we standardized on reusable component templates.

We'd also invest earlier in Argo's artifact management. Using S3 as the artifact repository from day one, rather than the default in-cluster minio instance, would have saved us a migration headache later.

Observability and Alerting

Argo Workflows emits metrics in Prometheus format. We track workflow success rates, step durations, and pod failure reasons. We've set up alerts for workflows that fail three times consecutively and for workflows that exceed their expected duration by more than 50%.

The visual workflow graph in the Argo UI is genuinely useful for debugging—you can see exactly which step failed, drill into the pod logs, and understand the data flow. It's one of the places where Argo's Kubernetes-native model pays real dividends.