Entry 003 · 03/05/2024 · 3 min read
LLMOps in Production: What Nobody Tells You
Running large language model inference workloads on Kubernetes at scale—the infrastructure problems that emerge past the prototype stage.
There's a comfortable fiction in the AI tooling space that deploying LLMs is mostly a software problem. Pick your framework, wrap it in an API, deploy it. The infrastructure "just works."
It doesn't. Once you move past single-instance demos and into production inference with real traffic patterns, you encounter a class of problems that have no clean software solution. They're fundamentally operational.
GPU Scheduling Is a First-Class Problem
On CPU workloads, oversubscription is a tool. You schedule more requests than you have CPU capacity for because CPU time is fungible and requests can queue. With GPU inference, you can't meaningfully oversubscribe. A model that needs 24GB of VRAM needs 24GB, period.
This means your cluster autoscaler needs to understand GPU resources, your pod scheduling needs topology awareness, and your capacity planning has to be far more precise than you're used to.
We run NVIDIA's GPU Operator on every inference cluster. Combined with the Kubernetes device plugin, it gives us proper resource accounting. But you still need to solve for bin packing—how do you fill GPU nodes efficiently when different model sizes require different memory footprints?
The Cold Start Problem Is Real
Model loading times are not like container startup times. A 7B parameter model in FP16 is about 14GB. Loading that from S3 into GPU memory takes 30-90 seconds depending on instance type and network throughput.
This makes auto-scaling reactive in a bad way: by the time a new instance is ready, the traffic spike has either passed or the queue has grown so long that users have given up.
Our current solution is predictive pre-warming based on time-of-day traffic patterns and a minimum floor of pre-warmed instances for our most critical endpoints. It's not elegant, but it works.
Observability for LLMs Is Different
Standard API metrics—latency, error rate, throughput—are necessary but insufficient for LLM workloads. You also need:
- Token generation rate (tokens/second per request and aggregate)
- Time to first token (critical for streaming endpoints)
- GPU utilization and memory pressure (separate from CPU metrics)
- Prompt and completion token counts (for cost attribution)
- Model-specific latency distributions (different from general p99)
We've built a sidecar that intercepts responses and emits these metrics to Prometheus. It's become one of the most-queried dashboards in Grafana.
Prompt Engineering Is an Infrastructure Problem
When your model starts returning degraded responses in production, what's the debugging path? Without prompt versioning and evaluation infrastructure, you're flying blind.
We treat prompts as artifacts: versioned, tested against a golden dataset before deployment, and rolled back if evaluation scores drop below threshold. The tooling for this is still immature, but the practice is essential.
Cost Attribution
LLM inference is expensive. At scale, "who is spending what" becomes a critical operational question. We use Kubernetes labels and namespaces to attribute GPU time and API costs to individual teams, and we've surfaced this into our internal cost dashboard with weekly per-team reports.
The reports change behavior. Teams start optimizing prompt lengths, implementing caching, and choosing smaller models where task requirements allow.