LLM-Aware Load Balancing
Route every request to the replica that will serve it fastest.
llm-d's endpoint picker scores each replica in real time across four signals: prefix cache locality, KV-cache utilization, queue depth, and predicted latency. Each request is dispatched to the replica with the lowest expected tail latency — delivering order-of-magnitude p99 improvements over round-robin routing, with no additional hardware.
Explore LLM-aware routing →Prefill / Decode Disaggregation
Scale prompt processing and token generation independently.
Prefill and decode have fundamentally different resource profiles. llm-d splits them across dedicated worker pools and transfers KV-cache between phases over RDMA via NIXL. The result is faster TTFT, more predictable TPOT, and better GPU utilization across the cluster.
See how disaggregation works →Wide Expert Parallelism
Serve frontier MoE models that don't fit on a single node.
llm-d combines data parallelism and expert parallelism across nodes to deploy large mixture-of-experts models like DeepSeek-R1. This pattern maximizes KV-cache space, enables long-context online serving, and supports high-throughput generation for batch and RL workloads.
Deploy wide-EP models →Tiered KV Prefix Caching
Cache at memory speed. Spill at storage cost.
llm-d extends KV-cache beyond accelerator HBM through a configurable storage hierarchy: HBM, CPU memory, local SSD, and shared remote storage (in progress). Hot prefixes stay close to the accelerator; cold prefixes spill to cheaper tiers automatically. You serve longer contexts and higher concurrency without adding GPUs.
Configure tiered caching →Workload Autoscaling
Scale for the load you have, on the hardware you have.
Two complementary patterns, both built on Kubernetes primitives. HPA scales replicas using live inference signals — queue depth and request counts from the endpoint picker. The Workload Variant Autoscaler routes across model variants on heterogeneous hardware to meet SLOs at the lowest cost.
Set up autoscaling →