Skip to main content

Architecture

High-level guide to llm-d architecture. Start here, then dive into specific guides.

Core Components

The llm-d architecture is built around three primary concepts: the Router, the InferencePool, and the Model Server.

  • llm-d Router - The intelligent entry point for inference requests. It provides LLM-aware load balancing, request queuing, and policy enforcement. It is composed of two functional parts:

    • Proxy: A high-performance L7 proxy (typically Envoy) that accepts user requests and consults the EPP via the ext-proc protocol to determine the optimal destination.
    • Endpoint Picker (EPP): The routing engine that scores and selects model server pods based on real-time metrics, KV-cache affinity, and configured policies.
  • InferencePool - The API that defines a group of Model Server Pods sharing the same model and compute configuration. Conceptualized as an "LLM-optimized Service", it serves as the discovery target for the Router.

  • Model Server - The inference engine (such as vLLM or SGLang) that executes the model on hardware accelerators (GPUs, TPUs, HPUs).

Basic llm-d Arch

Advanced Patterns

llm-d's core design can be extended with optional advanced patterns:

KV Cache Management

llm-d provides a comprehensive ecosystem for managing and reusing the KV cache across the inference pool. This includes:

See KV Cache Management for an overview of how these components compose.

Disaggregated Serving

In disaggregated serving, a single inference request is split into multiple phases (e.g., Prefill and Decode) handled by specialized workers. The llm-d Router orchestrates this flow by selecting both a prefill and a decode endpoint and coordinating the KV-cache transfer between them.

See Disaggregation for complete details.

Predicted Latency-Based Routing

The llm-d Router can be extended with "consultant" sidecars that provide advanced signals for routing decisions. The primary implementation is the Latency Predictor, which enables routing based on predicted ITL and TTFT.

  • Latency Predictor: Trains an XGBoost model online to predict request latency for better endpoint scoring and SLO enforcement.

Batch Inference

Batch and offline inference workloads are handled by two modules that can be deployed independently or together. The Batch Gateway provides an OpenAI-compatible Batch API for job management, while the Async Processor dispatches queued requests with flow-control gating. When composed, the Batch Gateway delegates dispatch to the Async Processor.

See Batch Inference for details on the batch inference design.

Autoscaling

llm-d supports proactive, SLO-aware autoscaling through two complementary approaches:

  • HPA/KEDA: Standard Kubernetes-native scaling using metrics exported by the EPP (like queue depth).
  • Workload Variant Autoscaler (WVA): Globally optimized scaling that minimizes cost by routing traffic across different model variants (e.g., different hardware or quantization) while meeting latency targets.

See Autoscaling for complete details.

Advanced llm-d Arch