Skip to main content

Production-grade distributed LLM inference.

llm-d is a distributed inference stack that orchestrates vLLM and SGLang across your cluster with LLM-aware routing, disaggregated serving, and tiered KV caching — using Kubernetes primitives you already run.

Key capabilities

LLM-Aware Load Balancing

Route every request to the replica that will serve it fastest.

llm-d's endpoint picker scores each replica in real time across four signals: prefix cache locality, KV-cache utilization, queue depth, and predicted latency. Each request is dispatched to the replica with the lowest expected tail latency — delivering order-of-magnitude p99 improvements over round-robin routing, with no additional hardware.

Explore LLM-aware routing

Prefill / Decode Disaggregation

Scale prompt processing and token generation independently.

Prefill and decode have fundamentally different resource profiles. llm-d splits them across dedicated worker pools and transfers KV-cache between phases over RDMA via NIXL. The result is faster TTFT, more predictable TPOT, and better GPU utilization across the cluster.

See how disaggregation works

Wide Expert Parallelism

Serve frontier MoE models that don't fit on a single node.

llm-d combines data parallelism and expert parallelism across nodes to deploy large mixture-of-experts models like DeepSeek-R1. This pattern maximizes KV-cache space, enables long-context online serving, and supports high-throughput generation for batch and RL workloads.

Deploy wide-EP models

Tiered KV Prefix Caching

Cache at memory speed. Spill at storage cost.

llm-d extends KV-cache beyond accelerator HBM through a configurable storage hierarchy: HBM, CPU memory, local SSD, and shared remote storage (in progress). Hot prefixes stay close to the accelerator; cold prefixes spill to cheaper tiers automatically. You serve longer contexts and higher concurrency without adding GPUs.

Configure tiered caching

Workload Autoscaling

Scale for the load you have, on the hardware you have.

Two complementary patterns, both built on Kubernetes primitives. HPA scales replicas using live inference signals — queue depth and request counts from the endpoint picker. The Workload Variant Autoscaler routes across model variants on heterogeneous hardware to meet SLOs at the lowest cost.

Set up autoscaling