Skip to main content

Autoscaling

With autoscaling, model servers are added or removed automatically to keep serving capacity aligned with inference demand. llm-d autoscalers consume three categories of scaling signals — supply-side, demand-side, and SLO-driven — surfaced through two complementary systems:

  • HPA/KEDA - Uses demand-side signals (EPP queue depth and active request counts) to scale model server replicas via Kubernetes HPA or KEDA. Well-suited for homogeneous deployments where each model scales independently.

    See HPA/KEDA for complete details on the HPA/KEDA design.

  • Workload Variant Autoscaler (WVA) - A global optimizer that, given an inventory of available accelerators, determines how to optimally place model servers — potentially serving different base models — onto those accelerators. WVA consumes supply-side signals (KV cache utilization, model server queue depth) or SLO-driven signals to proactively meet latency targets specified in its configuration. It accounts for heterogeneous hardware, disaggregated serving roles (prefill, decode, or both), and changing traffic patterns. When the accelerator inventory is insufficient to meet all targets, WVA degrades gracefully by prioritizing placement decisions that maximize overall SLO attainment.

    See Workload Variant Autoscaler (WVA) for complete details on the WVA design.

Features Matrix

HPA/KEDAWVA
Scaling SignalsIGW queue depth and running request countKV cache utilization, model server queue depth, SLO targets (Experimental), IGW queue size (Experimental)
Multiple-VariantsUnsupportedSupported — optimally places across models and topologies to minimize cost
Limited AcceleratorsFirst come, first servedFair share allocation
Scale to zeroSupportedSupported
Strong Latency SLOsNot guaranteedSupported by learning supply/demand dynamics and scaling proactively to meet targets (Experimental)
Pending Pods AwarenessUnsupported — external metrics do not account for pending (unscheduled) podsSupported — incorporates pending pod state into scaling decisions
Operational ComplexityLow - Standard Kubernetes HPA/KEDA onlyMedium - Requires WVA controller and VariantAutoscaling CRD
note

Native Kubernetes HPA scale-to-zero requires cluster support for the HPAScaleToZero feature. KEDA-based scale-to-zero is an alternative when that HPA feature is not enabled. For WVA-specific requirements, see the linked design documentation.

Choosing an Approach

  • HPA/KEDA - Homogeneous hardware, independent per-model scaling, demand-side signals only.
  • WVA - Heterogeneous hardware, multiple serving variants, supply-constrained environments, and/or SLO-driven scaling.