Glossary
Quick-reference definitions for terms used throughout the llm-d documentation. For a high-level overview of how these pieces fit together, see the Architecture Overview.
Aggregated Serving — The default serving mode where a single Model Server handles both Prefill and Decode for each request, as opposed to Disaggregated Serving.
Consultant — An optional sidecar component that the EPP queries for additional scoring signals beyond built-in metrics. Examples include the Latency Predictor and the KV-Cache Indexer.
Data Parallelism (DP) — Running independent model replicas on separate GPUs within the same node, each handling different requests for higher aggregate throughput. See Model Servers.
Decode — The second phase of LLM inference that generates output tokens one at a time, each depending on the previous token's KV Cache state. Decode throughput is measured by TPOT. See Architecture Overview.
Disaggregated Serving — A deployment pattern that separates Prefill and Decode into dedicated, independently scalable pools of Model Servers, connected by NIXL for KV-cache transfer. See Disaggregation.
Envoy — A high-performance L7 GAIE-conformant proxy that can be used with llm-d as a data-plane Proxy. It communicates routing decisions with the EPP via ext-proc. See Proxy.
Expert Parallelism (EP) — Distributing the expert layers of MoE models across multiple GPUs, enabling large models like DeepSeek-R1 to be served across nodes. See Model Servers.
ext-proc (External Processing) — An Envoy filter protocol that offloads per-request routing decisions to an external gRPC service — in llm-d, the EPP. This is the communication channel between the Proxy and the routing logic. See EPP.
Flow Control — The EPP subsystem that manages admission, queuing, and dispatch of requests using a Priority, Fairness, and Ordering hierarchy to prevent backend overload. See EPP.
Gateway API — The Kubernetes-native API for configuring L7 traffic routing, succeeding Ingress. llm-d uses Gateway API resources (HTTPRoute, Gateway) to route external traffic to InferencePools. See Proxy.
Gateway API Inference Extension (GAIE) — An extension to Gateway API that adds inference-aware routing via ext-proc. Defines the InferencePool CRD. See Proxy.
InferencePool — A Kubernetes custom resource, defined by the Gateway API Inference Extension, that represents the set of Model Server pods an EPP considers when routing a request. See InferencePool.
KV Cache — Key-value tensor cache storing intermediate attention states during LLM inference. Reusing cached entries for shared prompt prefixes (Prefix Caching) avoids redundant computation and reduces latency. See Architecture Overview.
KV-Cache Indexer — A Consultant component that maintains a globally consistent view of KV-cache block distribution across Model Servers using KV-Events, enabling precise prefix cache-aware routing. See KV-Cache Indexer.
KV-Events — Events emitted by Model Servers (via ZeroMQ) when KV-cache blocks are created or evicted. Consumed by the KV-Cache Indexer for real-time cache state tracking.
Latency Predictor — A Consultant that uses XGBoost quantile regression models trained on live traffic to predict per-endpoint TTFT and TPOT, enabling SLO-aware routing. See Latency Predictor.
llm-d — A distributed inference serving stack that adds intelligent routing, KV Cache-aware routing, Disaggregated Serving, and autoscaling on top of existing Model Servers. See Introduction.
llm-d Async Processor — A lightweight dispatch agent that pulls individual inference requests from message queues (such as Redis and Google Pub/Sub) and sends them to the llm-d Router. It adjusts the dispatch rate based on system metrics to protect interactive traffic. See Batch Inference.
llm-d Batch Gateway — An OpenAI-compatible API server (/v1/batches, /v1/files) for submitting, tracking, and managing batch inference jobs. It coordinates with the llm-d Async Processor for throttled request dispatch. See Batch Inference.
llm-d Endpoint Picker (EPP) — The central routing component of llm-d. Receives ext-proc callbacks from the Proxy, evaluates candidate Model Servers through a Plugin Pipeline of filters, scorers, and pickers, and returns the address of the optimal backend. See EPP.
llm-d Router — The intelligent entry point for inference requests. It provides LLM-aware load balancing (e.g., prefix-cache and load-aware routing) and request queuing, and manages disaggregated serving. It is composed of two functional parts: a data-plane Proxy (e.g., Envoy) and the Endpoint Picker (EPP). See Architecture Overview.
llm-d Well-Lit Path — A pre-validated, end-to-end deployment recipe (model + hardware + Helm values + benchmarks) that the llm-d community tests and supports as a first-class configuration. See Introduction.
MoE (Mixture of Experts) — A model architecture where only a subset of "expert" sub-networks activate per token, enabling very large models (e.g., DeepSeek-R1) to run efficiently. llm-d supports MoE serving via Wide Expert Parallelism.
Model Server — The inference engine (e.g., vLLM, SGLang) that loads model weights, runs inference on hardware accelerators, and manages a local KV Cache. The EPP routes requests to the optimal server instance. See Model Servers.
NIXL — NVIDIA Inference Xfer Library for high-speed GPU-to-GPU KV-cache transfer over InfiniBand, RoCE, EFA, and TCP. Used between Prefill and Decode workers in Disaggregated Serving.
Plugin Pipeline — The modular Filter, Score, Pick architecture inside the EPP that evaluates and selects Model Server endpoints for each request. Filters narrow candidates, scorers rank them, and pickers make the final selection. See EPP.
Prefix Caching — A technique where the EPP routes requests to Model Servers that already hold matching KV Cache entries for the prompt prefix, eliminating redundant Prefill computation and reducing TTFT. See Architecture Overview.
Prefill — The first phase of LLM inference that processes all input tokens in parallel to populate the KV Cache. Prefill latency is the dominant component of TTFT. See Architecture Overview.
Proxy — The L7 data-plane component that accepts client requests and delegates routing decisions to the EPP via ext-proc. Can be deployed via Gateway API or in Standalone mode with a GAIE-conformant proxy (such as Envoy) running as a sidecar to the EPP. See Proxy.
Saturation Detector — A safety mechanism in the EPP that evaluates whether the backend InferencePool is overloaded based on queue depth and KV-cache utilization, triggering Flow Control or request shedding.
SGLang — An open-source LLM serving engine that can be used as a Model Server backend in llm-d, providing RadixAttention-based Prefix Caching and disaggregation support. See Model Servers.
Tensor Parallelism (TP) — Sharding model layers across multiple GPUs within a node to serve models that exceed single-GPU memory. See Model Servers.
KV Cache Management — A comprehensive ecosystem for managing and reusing the KV cache across the inference pool. It includes Prefix-Cache Aware Routing, KV-Cache Indexing, and KV Offloading to scale effective cache capacity beyond hardware limits. See KV Cache Management.
TPOT (Time Per Output Token) — The average latency to generate each subsequent token during Decode. A key metric for streaming response quality. See Architecture Overview.
TTFT (Time To First Token) — The latency from request arrival to the first generated output token, dominated by Prefill time. Prefix Caching is the primary optimization for reducing TTFT. See Architecture Overview.
vLLM — An open-source high-throughput LLM serving engine and the default Model Server in llm-d. Provides PagedAttention, continuous batching, Prefix Caching, and KV-Events for cache-aware routing. See Model Servers.
Wide Expert Parallelism — A deployment pattern for large MoE models that combines Data Parallelism and Expert Parallelism across multiple nodes, maximizing KV-cache space for long-context serving. See Introduction.
Workload Variant Autoscaler (WVA) — A multi-model, SLO-aware autoscaler that optimizes cost on heterogeneous hardware by measuring instance capacity, deriving load functions, and calculating the optimal mix of model variants. See Architecture Overview.