Skip to content

Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367

Draft
Copilot wants to merge 4 commits into
ww26_PR_sycl_mpfrom
copilot/implement-multi-layer-block-kv-transfer-sycl
Draft

Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367
Copilot wants to merge 4 commits into
ww26_PR_sycl_mpfrom
copilot/implement-multi-layer-block-kv-transfer-sycl

Conversation

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown

Ports the block-level multi-layer KV transfer kernel (multi_layer_block_kv_transfer) to SYCL so it runs on Intel XPU without any CUDA dependency.

New files

  • csrc/sycl/mp_mem_kernels_sycl.h — Standalone header with PageBufferShapeDesc (plain inline, no __host__ __device__), MemoryObj4<T>, and the function declaration. Compiles with icpx alone; deleting all CUDA sources still builds.

  • csrc/sycl/mp_mem_kernels_sycl.cpp — SYCL kernel implementation:

    • nd_range<3> grid (kv_size, total_blocks, nl) with flat 1D local range; [[sycl::reqd_sub_group_size(16)]]; WG size rounded to SG-16 multiple, capped at 256
    • Vectorization: int64_tint32_tint16_t selected by head_bytes alignment (no uint4)
    • Compile-time LMCACHE_TO_ENGINE and FORMAT template params eliminate runtime branches
    • Supports 5 NHD + MLA formats: NB_NL_TWO_BS_NH_HS, NL_X_TWO_NB_BS_NH_HS, NL_X_NB_TWO_BS_NH_HS, NL_X_NB_BS_HS, NL_X_NBBS_ONE_HS; unsupported formats throw std::runtime_error
  • tests/v1/test_mp_mem_kernels_sycl.py — D2H→H2D roundtrip and skip_prefix_n_blocks tests for all 5 formats on XPU

Modified files

  • csrc/sycl/pybind_sycl.cpp — Adds #include <pybind11/stl.h>, includes mp_mem_kernels_sycl.h, registers PageBufferShapeDesc class and multi_layer_block_kv_transfer in xpu_ops (matching the c_ops binding structure)
  • setup.py — Adds mp_mem_kernels_sycl.cpp to sycl_sources

Usage

import lmcache.xpu_ops as xpu_ops

shape_desc = xpu_ops.PageBufferShapeDesc()
shape_desc.kv_size = 2
shape_desc.nl, shape_desc.nb, shape_desc.bs = nl, nb, bs
shape_desc.nh, shape_desc.hs, shape_desc.element_size = nh, hs, 2
shape_desc.block_stride_elems = 0

xpu_ops.multi_layer_block_kv_transfer(
    paged_buffer_ptrs_tensor, lmcache_objects_ptrs,
    block_ids, device,
    xpu_ops.TransferDirection.D2H,
    shape_desc, lmcache_chunk_size,
    xpu_ops.GPUKVFormat.NL_X_TWO_NB_BS_NH_HS,
    skip_prefix_n_blocks=0,
)

Copilot AI changed the title [WIP] Implement SYCL version of multi_layer_block_kv_transfer Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU Jun 12, 2026
Copilot AI requested a review from hlin99 June 12, 2026 05:38

@hlin99 hlin99 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] FAILED [ 10%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] FAILED [ 20%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] FAILED [ 30%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] FAILED [ 40%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] FAILED [ 50%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] FAILED [ 60%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] FAILED [ 70%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] FAILED [ 80%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] FAILED [ 90%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] FAILED [100%]

========================================================================================================== FAILURES ==========================================================================================================
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-cross_layer] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:


self = <random.Random object at 0x2e662100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-normal] _________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:


self = <random.Random object at 0x2e684800>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-flash_infer] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:


self = <random.Random object at 0x2e6a9c10>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
__________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-mla] ___________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:


self = <random.Random object at 0x2e6d93a0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-sglang_mla] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:


self = <random.Random object at 0x2e6b59f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-cross_layer] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:


self = <random.Random object at 0x2e6df070>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-normal] ________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:


self = <random.Random object at 0x2e6835f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-flash_infer] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:


self = <random.Random object at 0x2e6b5100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-mla] __________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:


self = <random.Random object at 0x2e6d7b30>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-sglang_mla] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)
  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:


self = <random.Random object at 0x2e6e0c80>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:
      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
====================================================================================================== warnings summary ======================================================================================================
../../../opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: 50 warnings
/opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: DeprecationWarning: LoggingHandler in opentelemetry-sdk is deprecated. Use the handler from opentelemetry-instrumentation-logging instead.
warnings.warn(

:488
:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute

:488
:488: DeprecationWarning: builtin type SwigPyObject has no module attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================================== short test summary info ===================================================================================================
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
============================================================================================== 10 failed, 52 warnings in 1.55s ===============================================================================================
root@b60-8-hq-1:~/hlin76/LMCache#
@copilot

Copilot AI commented Jun 15, 2026

Copy link
Copy Markdown
Author

tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] FAILED ...

Fixed in commit b1d9cc7. The root cause was a ValueError: Sample larger than population at:

block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)  # line 218

With NB=100 and TOTAL_BLOCKS=64, sampling 64 D2H block IDs from range(100) leaves only 100 - 64 = 36 blocks in available, but we then tried to sample 64 H2D block IDs from those 36 — impossible. Changed NB from 100 to 200 so available always contains 200 - 64 = 136 ≥ 64 entries.

Copilot AI requested a review from hlin99 June 15, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants