Skip to main content

Remote Direct Memory Access (RDMA) and Networking Configuration

Why Networking Matters

In prefill/decode disaggregation, the Key-Value (KV) Cache must transfer from prefill to decode workers before the first token can be generated. This transfer time lands directly on Time to First Token (TTFT) — and the cost grows with context length and model size.

For Wide Expert Parallelism, all-to-all GPU communication across nodes is on the critical path for every token generated.

Networking is a first-order concern for distributed inference latency.

The Networking Stack

llm-d uses a layered networking stack for KV Cache transfers and inter-node communication:

Networking Stack

NIXL

NIXL (NVIDIA Inference Xfer Library) is the transfer library used by vLLM to move KV Cache between GPUs. It abstracts the underlying transport behind a unified API, so vLLM can initiate transfers without knowledge of the network fabric.

NIXL operates in a pull-based model: the decode pod fetches KV Cache blocks directly from the prefill pod's GPU memory using one-sided RDMA reads (direct memory access without involving the remote CPU), without requiring active participation from the prefill pod. This reduces synchronization overhead.

Key capabilities:

  • Works across InfiniBand, RDMA over Converged Ethernet (RoCE), Elastic Fabric Adapter (EFA), and TCP
  • Supports GPU memory (VRAM), CPU Dynamic RAM (DRAM), and storage backends
  • Plugin architecture for adding new transport backends
  • Supports Tensor Parallel (TP) heterogeneity (prefill and decode can use different tensor-parallel sizes)

UCX

UCX (Unified Communication X) is the default transport backend for NIXL. It is a mature, open-source communication framework with broad adoption across High-Performance Computing (HPC) clusters. UCX abstracts RDMA transports (InfiniBand, RoCE), shared memory, and TCP behind a single API.

UCX is a good default: it is battle-tested, widely supported, and works across most hardware. However, it was designed for HPC workloads and carries complexity that can make it harder to tune for AI inference traffic patterns.

UCCL

UCCL (Unified Cloud Communication Library) is a newer transport backend integrated into NIXL as of llm-d v0.5. It implements a CPU-managed software transport stack — managing transport logic on the CPU rather than relying solely on network interface card (NIC) hardware offload. This enables fine-grained flow splitting and adaptive congestion control.

UCCL currently supports:

  • Native RDMA (InfiniBand/RoCE)
  • GPUDirect TCP-X (Google Cloud)
  • TCP
  • EFA (AWS)

Currently, UCCL needs to be built for a specific transport option with the USE_TCPX/USE_TCP/USE_EFA flag (refer to build instructions). In the future, this will be enhanced to provide runtime selection. UCCL automatically discovers network interface cards (NICs) based on PCIe proximity during memory registration, removing the need for manual NIC-to-GPU mapping in most cases.

libfabric

On AWS, NIXL uses libfabric as the transport backend. EFA (Elastic Fabric Adapter) requires OpenFabrics Interfaces (OFI) — UCX does not support EFA natively. The libfabric plugin provides multi-rail RDMA (using multiple network paths simultaneously for higher bandwidth) with topology-aware GPU-to-EFA mapping via hwloc.

Choosing a Transport Backend

EnvironmentBackendRationale
On-premise InfiniBand / RoCEUCXMature, battle-tested on HPC fabrics with dedicated, uncongested paths
Cloud with RoCE (GKE, Azure, etc.)UCCLSoftware packet spraying avoids single-path congestion on shared fabric
GKE with GPUDirect TCP-XUCCLNative support for Google's GPU-initiated TCP transport
AWS with EFAlibfabric/UCCLEFA requires OFI/libfabric; UCX doesn't support EFA
TCP-only (XPU, HPU, CPU)UCX/UCCLSimplest configuration for non-RDMA environments

The core tradeoff:

  • UCX offloads transport to NIC hardware — works best when the network fabric has dedicated, uncongested paths, typical in on-premise High-Performance Computing (HPC) clusters with InfiniBand.
  • UCCL manages transport in software on the CPU — it splits traffic across up to 256 network paths with adaptive congestion control. This matters in cloud environments where network paths are shared and individual paths may be congested.
  • libfabric is the default option for AWS EFA. UCCL also supports EFA but requires compilation with the USE_EFA flag. UCX does not support EFA.

NIXL selects the backend based on what is available and the memory types involved. You control which backends are loaded at agent creation time.

Configuration

vLLM KV Transfer

Enable NIXL-based KV Cache transfer via the --kv-transfer-config flag:

vllm serve <model> \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

The kv_role is kv_both for both prefill and decode pods — each pod can both send and receive KV Cache.

For XPU and HPU devices where KV transfer happens via CPU memory, add:

--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu"}'

Backend Selection

NIXL uses UCX backend by default. NIXL's transport backend can be configured using the kv_connector_extra_config:

To configure NIXL with UCCL backend:

vllm serve <model> \
--kv-transfer-config '{"kv_connector":"NixlConnector",
"kv_role":"kv_both",
"kv_connector_extra_config":
{"backends":["UCCL"]}}'

NIXL Side Channel

NIXL uses a side channel for metadata exchange between pods. Configure with:

VariableDescriptionDefault
VLLM_NIXL_SIDE_CHANNEL_HOSTPod IP (use status.podIP fieldRef)Required
VLLM_NIXL_SIDE_CHANNEL_PORTMetadata exchange port5557

UCX Transport Selection

UCX transport is configured via environment variables:

VariableDescriptionExample
UCX_TLSTransport layers (TLS) to usesm,cuda_ipc,cuda_copy,rc,tcp
UCX_SOCKADDR_TLS_PRIORITYPriority for socket-based transport layerstcp
UCX_PROTO_INFOCheck transport selectiony
UCX_NET_DEVICESNetwork devices to use for transporte.g. mlx5_0:1, mlx5_1:1

For RDMA-capable clusters, UCX will automatically use RDMA verbs when available. For TCP-only clusters (XPU, HPU), set UCX_TLS=tcp.

RDMA Resources and Capabilities

RDMA requires device resources and elevated capabilities in the pod spec:

resources:
limits:
rdma/roce_gdr: "2"
requests:
rdma/roce_gdr: "2"
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RAWIO
- NET_ADMIN
- NET_RAW

NIC Selection

Use NCCL_EXCLUDE_IB_HCA to exclude specific Host Channel Adapters (HCAs) from NVIDIA Collective Communications Library (NCCL) traffic (e.g., management NICs):

- name: NCCL_EXCLUDE_IB_HCA
value: "mlx5_0,mlx5_2,mlx5_4,mlx5_8"

For Wide Expert Parallelism, map GPUs to specific HCAs for optimal topology:

- name: DEEP_EP_DEVICE_TO_HCA_MAPPING
value: "0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1"

Platform-Specific Notes

GKE

  • Use GKE multi-NIC annotations for RDMA interfaces:
    annotations:
    networking.gke.io/default-interface: eth0
    networking.gke.io/interfaces: '[{"interfaceName":"eth0","network":"default"}, ...]'
  • Source set_nccl_env.sh from /usr/local/gib/scripts/ at container startup
  • Set NVSHMEM_DISABLED_GDRCOPY=true (GKE recommendation)
  • Use pod affinity on cloud.google.com/gce-topology-block for topology-aware placement
  • GPU-initiated RDMA requires privileged: true in the security context

OpenShift / OCP

  • Use Multus CNI for secondary RDMA networks:
    annotations:
    k8s.v1.cni.cncf.io/networks: "multi-nic-compute"
  • Request rdma/roce_gdr device resources as shown above

AWS (EFA)

  • EFA support is built into the llm-d CUDA image when ENABLE_EFA=true
  • NIXL uses the libfabric backend (not UCX) — see Choosing a Transport Backend
  • Requires libfabric v1.21.0+ (or latest AWS EFA installer)
  • The libfabric plugin auto-discovers GPU-to-EFA topology via hwloc for optimal multi-rail placement
  • UCCL backend also supports EFA, however, it requires compiling with the USE_EFA option — see UCCL

Verifying Network Performance

After deploying model servers, verify two things:

1. GPU Topology

Confirm GPUs within each pod are optimally connected:

# NVIDIA
nvidia-smi topo -m # Look for NV/PIX, not SYS or PHB
nvidia-smi nvlink --status # Verify NVLink is active

# AMD
rocm-smi --showtopo # Confirm Infinity Fabric connectivity

GPUs showing SYS or PHB topology are communicating over PCIe across Non-Uniform Memory Access (NUMA) nodes — this adds latency, especially for collective operations.

2. Inter-Pod Network

Verify RDMA connectivity and bandwidth between pods:

# Check RDMA devices are available
ibv_devinfo

Run NIXL benchmark between prefill and decode pods. nixlbench requires an etcd server for peer coordination when using network backends.

Start a standalone etcd (e.g., in one of the pods or as a separate pod):

etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://$(hostname -i):2379 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-advertise-peer-urls http://$(hostname -i):2380 \
--initial-cluster "default=http://$(hostname -i):2380" &

From the prefill/decode pods, run:

nixlbench --etcd_endpoints http://<ETCD_SERVER_IP>:2379 --backend <UCX/UCCL/LIBFABRIC> --op_type=READ --check-consistency --start_batch_size=100 --max_batch_size=100 --max-block-size=85899340

The above test runs nixl benchmarks with the specified backend for message sizes of 1GB - 8GB, and reports the throughput, latency, etc.

If throughput is significantly below the expected line rate for your fabric, check NIC affinity, Maximum Transmission Unit (MTU) settings, and whether traffic is falling back to TCP (by setting UCX_PROTO_INFO=y for UCX backend).

If vLLM is already running, GPU memory may be insufficient for in-pod benchmarks. Add a pre-start script that runs tests before vLLM launches and blocks until a condition is met (e.g., removal of a sentinel file). In the future, this diagnostic will be automated as runtime scripts.

Further Reading