Async Processor Architecture

The Async Processor is a lightweight dispatch agent that pulls inference requests from message queues and forwards them to the llm-d Router. It uses dispatch gates to regulate dispatch rate based on system metrics, ensuring that the dispatched workloads don't overflow the inference servers.

How It Works

Poll — workers pull requests from one or more message queues.
Gate — before dispatching, each request passes through a dispatch gate that checks whether the system has capacity. If the gate is closed (budget = 0), the request waits.
Dispatch — the worker sends an HTTP request to the llm-d Router with deadline propagation.
Result — on success, results are written back to a queue. On retryable failure (rate limiting, transient errors), the request is re-queued with exponential backoff.

Dispatch Gates

The dispatch gate controls the rate by which the processor sends requests. Each queue can have its own gate, allowing independent dispatch control per workload.

Gate type	Behavior
`constant`	Always open — no throttling.
`redis`	Reads a budget value from a Redis key, allowing external systems to control dispatch rate.
`prometheus-saturation`	Queries Prometheus for model server saturation metrics. Dispatches when saturation is below a configurable threshold.
`prometheus-budget`	Computes available capacity from downstream metrics.

Message Queue Integrations

Implementation	Characteristics
Redis Sorted Set	Persisted, priority-ordered by deadline. Supports per-queue gate configuration.
Redis Pub/Sub	Ephemeral, fan-out delivery. Single global gate.
GCP Pub/Sub	Cloud-native, scalable. Supports per-subscription gating.

Concurrency and Retries

Worker pool — configurable number of concurrent workers (default 8) process requests in parallel.
Deadline enforcement — each request carries a deadline from the queue message. Workers abandon requests that cannot complete before their deadline.
Exponential backoff — retryable failures are re-queued with backoff (base 2s, max 60s, with jitter). Fatal errors (bad payload, unrecoverable failures) are not retried.

Observability

Prometheus metrics include request totals, success/failure counts, retry counts, deadline-exceeded counts, shedded request counts, and request latency histograms.

Async Processor Deployment Guide — operator guide for deploying the Async Processor.
Async Processor Repository — source code and Helm chart.
Batch Gateway — composes with the Async Processor for batch job management.

How It Works​

Dispatch Gates​

Message Queue Integrations​

Concurrency and Retries​

Observability​

Related​

How It Works

Dispatch Gates

Message Queue Integrations

Concurrency and Retries

Observability

Related