NixlConnector Compatibility Matrixยถ
This page documents the feature compatibility of disaggregated prefilling with the NixlConnector. For general usage instructions, see the NixlConnector Usage Guide. For an overview of disaggregated prefilling, see Disaggregated Prefilling.
Note
This page reflects the current state of the codebase and is subject to change as features evolve. Entries marked ๐ or โ may link to tracking issues. See the NIXL connector roadmap for upcoming feature development.
Legend:
- โ = Fully supported
- ๐ = Partial support (see footnotes)
- โ = Not supported
- โ = Unknown / not yet validated
- ๐ง = Work in progress
Universally supported features
The following features work with all model architectures when using NixlConnector PD disaggregated serving:
Chunked Prefill | APC (Prefix Caching) | Data Parallel | CUDA graph | Logprobs | Prompt Logprobs | Prompt Embeds | Multiple NIXL backends (UCX, GDS, LIBFABRIC, etc.)
Model Architecture x Capabilityยถ
| Model type | Basic PD | Spec Decode | Hetero TP | Cross-layer blocks | SWA | Host buffer | Hetero block size |
|---|---|---|---|---|---|---|---|
| Dense Transformers | โ | โ 1 | โ | โ 2 | โ | โ | ๐ 3 |
| MLA (e.g. DeepSeek-V2/V3) | โ | โ 1 | ๐ 4 | โ 2 | โ | โ | ๐ 3 |
| Sparse MLA (e.g. DeepSeek-V3.2) | โ | โ 1 | ๐ 4 | โ 2 | โ | โ | ๐ 3 |
| Hybrid SSM / Mamba | โ | โ | ๐ง5 | โ | โ | โ | โ6 |
| MoE | โ | โ 1 | โ | โ 2 | โ | โ | ๐ 3 |
| Multimodal | โ | โ | โ | โ | โ | โ | โ |
| Encoder-Decoder | โ | โ | โ | โ | โ | โ | โ |
1 P and D instances must use the same speculation configuration.
2 Requires FLASH_ATTN or FLASHINFER backend and HND KV cache layout. Enable via --kv-transfer-config '{"kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'.
3 Supported only when HMA is not required (i.e., non-hybrid models). Block IDs are remapped automatically. Only P block size < D block size is supported.
4 MLA KV cache is replicated across TP workers, so heterogeneous TP works but there is no head-splitting. When P TP > D TP, only a single read is executed (redundant ranks are skipped). D TP > P TP also works.
5 Hybrid SSM (Mamba) models require homogeneous TP (P TP == D TP). Heterogeneous TP is not yet supported for Mamba layers.
6 HMA (required by hybrid models) does not support different remote block sizes.
Configuration Notesยถ
What must match between P and Dยถ
By default, a compatibility hash is checked during handshake. P and D instances must agree on:
- vLLM version and NIXL connector version
- Model (architecture, dtype, number of KV heads, head size, number of hidden layers)
- Attention backend
- KV cache dtype (
cache_dtype)
Warning
Disable the hash check with --kv-transfer-config '{"kv_connector_extra_config": {"enforce_handshake_compat": false}}' at your own risk.
What can safely differ between P and Dยถ
tensor-parallel-size(heterogeneous TP, subject to model restrictions above)block-size(heterogeneous block size, subject to restrictions above)- Number of KV cache blocks (determined by available memory on each instance)
KV cache layoutยถ
- NixlConnector defaults to
HNDlayout for optimal transfer performance (non-MLA models). NHDlayout is supported but does not allow heterogeneous TP head splitting.- Experimental
HNDโNHDpermute: enable via--kv-transfer-config '{"enable_permute_local_kv": true}'. Not supported with HMA.
Quantized KV cacheยถ
Quantized KV cache (e.g., FP8) requires both P and D instances to use the same cache_dtype. Mismatched cache dtypes will fail the compatibility hash check during handshake.
- Static quantization (scales loaded from checkpoint): โ Supported. Scales are loaded independently by each instance from the model checkpoint.
- Dynamic quantization (scales computed at runtime): โ Not supported. Per-block scales are not transferred alongside KV cache data.
- Packed-layout scales (scales stored inline with weights): โ Supported. Scales are transferred together with the KV cache blocks.