← index

Entry 007 · 06/10/2024 · 8 min read

Running LLMs on GKE: what breaks before you find the Inference Gateway

When someone first asks to run an LLM on Kubernetes, the instinct is reasonable: it's a containerised workload, it exposes an HTTP API, it needs to scale. Deploy it like anything else — a `Deployment`, a `Service`, maybe an `HPA` on CPU. That instinct gets you surprisingly far. Until it doesn't

When I first tried running an LLM on GKE, I approached it like any other service I've deployed on Kubernetes.

Wrap the model behind an API, deploy it as a Deployment, expose it with a Service, and let the platform do its job. I expected the usual knobs—replicas, autoscaling, resource limits—to behave in predictable ways.

That assumption holds just long enough to be misleading.

I was able to get something working quickly, but under even moderate load, the system behaved in ways that didn't match my expectations. Latency was inconsistent. Some requests were fast, others were unexpectedly slow. Scaling the number of pods didn't reliably improve things.

Nothing was obviously "broken." The cluster was healthy. The pods were running. CPU utilisation sat around 40%, memory was fine, and kubectl top showed nothing alarming.

The issue turned out to be more fundamental. I was treating LLM inference like a stateless web workload when It isn't.

What follows is a set of notes to recalibrate that mental model and to document how GKE's inference-specific components change the picture.

What inference actually is, in one page

The simplest useful way to think about LLM inference is this: even though the API looks stateless, the execution model is not. The state just happens to live in GPU memory rather than in an external store. That mismatch is where most of the friction comes from.

To understand why, it helps to know about the two internal phases of LLM inference.

Prefill is where the model processes the input prompt. If the request contains a large number of tokens, this phase is expensive. The model has to read and encode the entire input before producing any output.

Decode is the generation phase. Tokens are produced one at a time, and each step depends on the full context accumulated so far. This phase can run for a long time depending on the output length.

The key structure underpinning both is the KV cache. As the model processes tokens, it stores intermediate state in GPU memory. This cache allows the model to avoid recomputing previous steps during decoding. It's essential for performance, but it introduces state that lives outside the usual request-response boundary.

This has two important consequences for how you should think about routing:

Requests are not uniform in cost. A short prompt with a short response is cheap. A long prompt with a long response is expensive. The difference is significant enough that "one request" is not a meaningful unit of load.

There is locality. If a follow-up request can reuse an existing KV cache, it's much cheaper. If it lands on a different pod, the cache is effectively lost and has to be rebuilt from scratch.

What breaks when you use a standard Service

My first setup used a standard Kubernetes Service in front of a set of LLM-serving pods.

That gives you basic load balancing, typically round-robin. It works fine for stateless services, but here it starts to show cracks pretty quickly.

Issue 1: Uneven load distribution. If one pod is already handling a large request—say a long prompt in the decode phase—it's effectively "busy" in a way that Kubernetes doesn't understand. Round-robin will still send new requests to it without considering that load. At the same time, another pod might be relatively idle. The result is higher latency and lower overall throughput than expected. The system looks balanced from the outside, but internally it isn't.

Issue 2: KV cache locality misses. A user sends a request that lands on Pod A. The KV cache for that session is now built up on that pod. If the next request from the same user gets routed to Pod B, that cache is not available. The model has to recompute everything from scratch. On a 2,000-token system prompt—common in production chatbots—that recomputation can add several seconds to time-to-first-token, every single time. From the user's perspective, this shows up as inconsistent response times.

Issue 3: Autoscaling on the wrong signal. The default approach is to use HPA based on CPU. In this setup, CPU is not the bottleneck—GPU utilisation and VRAM are. It's possible to have low CPU usage while the GPU is fully saturated. Even when switching to GPU metrics, it's still not a complete signal. Two pods with similar GPU utilisation can be in very different states depending on the mix of requests they're handling and the size of their KV caches.

At this point, it became clear that the problem isn't just scaling or resource allocation. It's that the routing layer has no awareness of what's actually happening inside the pods.

The layering: Gateway API → GAIE → GKE Inference Gateway

The approach GKE takes is not to replace existing Kubernetes networking primitives, but to extend them. This matters practically: if your team already has Gateway API resources in production, you don't need to rip them out. You extend them.

The base layer is the Gateway API — the standard Kubernetes way of defining ingress and routing using resources like Gateway and HTTPRoute.

On top of that sits the Gateway API Inference Extension (GAIE) — an official Kubernetes project (kubernetes-sigs) that introduces inference-specific routing concepts. This is not a GKE-specific thing; it works with Envoy Gateway, kgateway, Istio, and others.

Then there is the GKE Inference Gateway — Google's managed implementation of GAIE, integrated into GKE with Cloud Monitoring dashboards, Cloud Armor support, and managed components so you don't have to operate the extension yourself.

Standard Gateway API
        ↓  extends
Gateway API Inference Extension (GAIE)
        ↓  implements
GKE Inference Gateway

The flow in practice:

  1. A request hits the Gateway
  2. The HTTPRoute determines the backend
  3. Instead of a Service, the backend is an inference-aware resource (InferencePool)
  4. That layer decides which pod should handle the request based on runtime GPU state

This introduces a decision point that can account for GPU load, memory pressure, and KV cache locality — things a standard load balancer ignores entirely.

The new primitives: InferencePool and InferenceModel

Two new CRDs appear in this model.

InferencePool groups a set of pods that can serve a model. At a glance it looks similar to a Service, but it behaves differently. When traffic is routed to an InferencePool, the system doesn't pick an endpoint randomly or round-robin. It uses an internal component — the Endpoint Picker (EPP) — to select a pod for each request. That selection accounts for current load, available GPU memory, and whether a relevant KV cache is already resident on that pod. This is the first place where routing decisions start to look more like scheduling decisions.

InferenceModel represents a logical model identity from the client's perspective. It can map to one or more pools, and — critically — it can map logical model names to specific LoRA adapters loaded on pods. For example, you might expose gemma-customer-support and gemma-code as two InferenceModel resources, both backed by the same base model pool but routing to pods with different LoRA adapters preloaded. The client just names the model; the gateway handles the rest.

The key difference from standard Kubernetes abstractions is that these resources carry semantic meaning about inference workloads, not just connectivity.

A minimal working setup

A minimal configuration still uses familiar objects. You define a Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llm-gateway
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - protocol: HTTP
      port: 80
      name: http

Then an HTTPRoute — the key difference is routing to an InferencePool instead of a Service:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
    - name: llm-gateway
  rules:
    - backendRefs:
        - name: my-inference-pool
          group: inference.networking.x-k8s.io
          kind: InferencePool

The InferencePool selects pods by label:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: my-inference-pool
spec:
  selector:
    matchLabels:
      app: vllm-deployment
  targetPortNumber: 8000

And the vLLM Deployment — note the readiness probe, which is load-bearing here. The inference gateway uses readiness to know when a pod is actually serving (model fully loaded), not just when the container started. Without it, the gateway may route to pods that are still pulling a multi-gigabyte model into GPU memory:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-deployment
  template:
    metadata:
      labels:
        app: vllm-deployment
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          resources:
            limits:
              nvidia.com/gpu: 1
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 3

The initialDelaySeconds on the readiness probe matters — vLLM won't respond to /health until the model is loaded, which can take several minutes depending on model size.

What this unlocks

After switching to this model, a few things started to make more sense.

Routing can be aligned with how inference actually behaves. Requests are no longer distributed blindly, which reduces the latency variability that comes from KV cache misses and load imbalance.

It also opens the door to more advanced patterns: prefix-aware routing that reuses cached computation, LoRA adapter routing behind a single endpoint, and autoscaling on KV cache utilisation rather than CPU — which is the metric that actually reflects whether your inference pods are under pressure.

There's still a lot that feels early. The abstractions are evolving, and the operational patterns aren't as well established as standard Kubernetes workloads. But the direction is clear: once you treat inference as its own class of workload, the rest of the system starts to fall into place.

In the next post, I'll walk through what actually changes once you wire up KV-cache-aware routing and swap your HPA target to inference metrics — with numbers from a real deployment.