Skip to main content

Async Processor Architecture

The Async Processor is a lightweight dispatch agent that pulls inference requests from message queues and forwards them to the llm-d Router. It uses dispatch gates to regulate dispatch rate based on system metrics, ensuring that the dispatched workloads don't overflow the inference servers.

How It Works

  1. Poll — workers pull requests from one or more message queues.
  2. Gate — before dispatching, each request passes through a dispatch gate that checks whether the system has capacity. If the gate is closed (budget = 0), the request waits.
  3. Dispatch — the worker sends an HTTP request to the llm-d Router with deadline propagation.
  4. Result — on success, results are written back to a queue. On retryable failure (rate limiting, transient errors), the request is re-queued with exponential backoff.

Dispatch Gates

The dispatch gate controls the rate by which the processor sends requests. Each queue can have its own gate, allowing independent dispatch control per workload.

Gate typeBehavior
constantAlways open — no throttling.
redisReads a budget value from a Redis key, allowing external systems to control dispatch rate.
prometheus-saturationQueries Prometheus for model server saturation metrics. Dispatches when saturation is below a configurable threshold.
prometheus-budgetComputes available capacity from downstream metrics.

Message Queue Integrations

ImplementationCharacteristics
Redis Sorted SetPersisted, priority-ordered by deadline. Supports per-queue gate configuration.
Redis Pub/SubEphemeral, fan-out delivery. Single global gate.
GCP Pub/SubCloud-native, scalable. Supports per-subscription gating.

Concurrency and Retries

  • Worker pool — configurable number of concurrent workers (default 8) process requests in parallel.
  • Deadline enforcement — each request carries a deadline from the queue message. Workers abandon requests that cannot complete before their deadline.
  • Exponential backoff — retryable failures are re-queued with backoff (base 2s, max 60s, with jitter). Fatal errors (bad payload, unrecoverable failures) are not retried.

Observability

Prometheus metrics include request totals, success/failure counts, retry counts, deadline-exceeded counts, shedded request counts, and request latency histograms.