Skip to content

NixlConnector Compatibility Matrixยถ

This page documents the feature compatibility of disaggregated prefilling with the NixlConnector. For general usage instructions, see the NixlConnector Usage Guide. For an overview of disaggregated prefilling, see Disaggregated Prefilling.

Note

This page reflects the current state of the codebase and is subject to change as features evolve. Entries marked ๐ŸŸ  or โŒ may link to tracking issues. See the NIXL connector roadmap for upcoming feature development.

Legend:

  • โœ… = Fully supported
  • ๐ŸŸ  = Partial support (see footnotes)
  • โŒ = Not supported
  • โ” = Unknown / not yet validated
  • ๐Ÿšง = Work in progress

Universally supported features

The following features work with all model architectures when using NixlConnector PD disaggregated serving:

Chunked Prefill | APC (Prefix Caching) | Data Parallel | CUDA graph | Logprobs | Prompt Logprobs | Prompt Embeds | Multiple NIXL backends (UCX, GDS, LIBFABRIC, etc.)

Model Architecture x Capabilityยถ

Model type Basic PD Spec Decode Hetero TP Cross-layer blocks SWA Host buffer Hetero block size
Dense Transformers โœ… โœ…1 โœ… โœ…2 โœ… โœ… ๐ŸŸ 3
MLA (e.g. DeepSeek-V2/V3) โœ… โœ…1 ๐ŸŸ 4 โœ…2 โœ… โœ… ๐ŸŸ 3
Sparse MLA (e.g. DeepSeek-V3.2) โœ… โœ…1 ๐ŸŸ 4 โœ…2 โœ… โœ… ๐ŸŸ 3
Hybrid SSM / Mamba โœ… โ” ๐Ÿšง5 โŒ โœ… โœ… โŒ6
MoE โœ… โœ…1 โœ… โœ…2 โœ… โœ… ๐ŸŸ 3
Multimodal โ” โ” โ” โ” โ” โ” โ”
Encoder-Decoder โŒ โŒ โŒ โŒ โŒ โŒ โŒ

1 P and D instances must use the same speculation configuration.

2 Requires FLASH_ATTN or FLASHINFER backend and HND KV cache layout. Enable via --kv-transfer-config '{"kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'.

3 Supported only when HMA is not required (i.e., non-hybrid models). Block IDs are remapped automatically. Only P block size < D block size is supported.

4 MLA KV cache is replicated across TP workers, so heterogeneous TP works but there is no head-splitting. When P TP > D TP, only a single read is executed (redundant ranks are skipped). D TP > P TP also works.

5 Hybrid SSM (Mamba) models require homogeneous TP (P TP == D TP). Heterogeneous TP is not yet supported for Mamba layers.

6 HMA (required by hybrid models) does not support different remote block sizes.

Configuration Notesยถ

What must match between P and Dยถ

By default, a compatibility hash is checked during handshake. P and D instances must agree on:

  • vLLM version and NIXL connector version
  • Model (architecture, dtype, number of KV heads, head size, number of hidden layers)
  • Attention backend
  • KV cache dtype (cache_dtype)

Warning

Disable the hash check with --kv-transfer-config '{"kv_connector_extra_config": {"enforce_handshake_compat": false}}' at your own risk.

What can safely differ between P and Dยถ

  • tensor-parallel-size (heterogeneous TP, subject to model restrictions above)
  • block-size (heterogeneous block size, subject to restrictions above)
  • Number of KV cache blocks (determined by available memory on each instance)

KV cache layoutยถ

  • NixlConnector defaults to HND layout for optimal transfer performance (non-MLA models).
  • NHD layout is supported but does not allow heterogeneous TP head splitting.
  • Experimental HND โ†” NHD permute: enable via --kv-transfer-config '{"enable_permute_local_kv": true}'. Not supported with HMA.

Quantized KV cacheยถ

Quantized KV cache (e.g., FP8) requires both P and D instances to use the same cache_dtype. Mismatched cache dtypes will fail the compatibility hash check during handshake.

  • Static quantization (scales loaded from checkpoint): โœ… Supported. Scales are loaded independently by each instance from the model checkpoint.
  • Dynamic quantization (scales computed at runtime): โŒ Not supported. Per-block scales are not transferred alongside KV cache data.
  • Packed-layout scales (scales stored inline with weights): โœ… Supported. Scales are transferred together with the KV cache blocks.