vllm.v1.attention.ops.merge_attn_states ¶
merge_attn_states ¶
merge_attn_states(
output: Tensor,
prefix_output: Tensor,
prefix_lse: Tensor,
suffix_output: Tensor,
suffix_lse: Tensor,
output_lse: Tensor | None = None,
prefill_tokens_with_context: int | None = None,
) -> None
Merge partial attention outputs from prefix (KV cache) and suffix (new tokens) into a single output tensor using the log-sum-exp (LSE) rescaling method described in section 2.2 of https://www.arxiv.org/pdf/2501.01005.
For tokens that have prefix context (token index < prefill_tokens_with_context), the prefix and suffix partial outputs are combined as a weighted sum. For tokens without prefix context, the suffix output is copied directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output | Tensor | Output tensor of shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE]. | required |
prefix_output | Tensor | Partial attention output over the prefix (KV cache), shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE]. | required |
prefix_lse | Tensor | Log-sum-exp values for the prefix attention, shape [NUM_HEADS, NUM_TOKENS]. | required |
suffix_output | Tensor | Partial attention output over the suffix (new KV), shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE]. | required |
suffix_lse | Tensor | Log-sum-exp values for the suffix attention, shape [NUM_HEADS, NUM_TOKENS]. | required |
output_lse | Tensor | None | Optional tensor to store the merged LSE values, shape [NUM_HEADS, NUM_TOKENS]. If None, LSE is not written out. | None |
prefill_tokens_with_context | int | None | Number of prefill tokens that have prefix context and therefore require merging. Tokens at indices
| None |