vllm.v1.simple_kv_offload.cuda_mem_ops ¶
Low-level CUDA memory helpers: pinning and batch DMA transfers.
_resolve_batch_memcpy ¶
Resolve cuMemcpyBatchAsync via cuGetProcAddress (one-time).
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
copy_blocks ¶
copy_blocks(
src_block_ids: list[int],
dst_block_ids: list[int],
params: BatchMemcpyParams,
) -> None
Copy blocks via cuMemcpyBatchAsync.
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
pin_tensor ¶
pin_tensor(tensor: Tensor) -> None
Pin a CPU tensor via cudaHostRegister.
This bypasses PyTorch's CUDACachingHostAllocator which rounds every pin_memory=True allocation up to the next power of 2 (e.g. 100 GB becomes 128 GB).