Optimized Baseline

Traditional HTTP requests are fast, uniform, and cheap. Standard round-robin request scheduling strategies balance this load well.

LLM requests break all three assumptions. They are:

Multi-turn - conversations and agentic tool loops send the same growing prefix repeatedly
Slow - a single request can take over a minute generating tokens
Non-uniform - range from 1000s of reasoning tokens to a 100k+ context tokens

The llm-d Router injects awareness of the LLM-workload into the load-balancing layer considering prefix-cache affinity and server load metrics.

note

This guide demonstrates one approach to prefix- and load-aware routing. The llm-d Router supports other options as well, including session affinity and active request based routing, which make no assumptions about the router's ability to parse the request or probe the servers. See configuration for more details on the available scorers, or precise prefix-cache-aware routing for KV-event-driven scoring.

Deploy

See the optimized baseline guide for manifests and step-by-step deployment.

Architecture

Prefix-Aware Scheduling

EPP maintains a view of each endpoints' prefix-cache state in memory. When a request arrives, it identifies which pod already holds the matching prefix in KV-cache and routes the request there. For multi-turn workloads, this optimization is critical to avoid excessive recomputation in a scale-out setting.

Prefix-Aware Routing

Load-Aware Scheduling

EPP continuously probes each endpoints' metrics by scraping /metrics at a regular interval (50ms default). It scores endpoints on queue depth, running requests, and KV-cache utilization to schedule requests to the endpoint with the lowest load, avoiding hotspots caused by heterogeneous request patterns.

Load-Aware Routing

Deploy​

Architecture​

Prefix-Aware Scheduling​

Load-Aware Scheduling​

Further Reading​

Deploy

Architecture

Prefix-Aware Scheduling

Load-Aware Scheduling

Further Reading