Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU by Copilot · Pull Request #367 · hlin99/LMCache

Copilot · 2026-06-12T05:25:25Z

Ports the block-level multi-layer KV transfer kernel (multi_layer_block_kv_transfer) to SYCL so it runs on Intel XPU without any CUDA dependency.

New files

csrc/sycl/mp_mem_kernels_sycl.h — Standalone header with PageBufferShapeDesc (plain inline, no __host__ __device__), MemoryObj4<T>, and the function declaration. Compiles with icpx alone; deleting all CUDA sources still builds.
csrc/sycl/mp_mem_kernels_sycl.cpp — SYCL kernel implementation:
- nd_range<3> grid (kv_size, total_blocks, nl) with flat 1D local range; [[sycl::reqd_sub_group_size(16)]]; WG size rounded to SG-16 multiple, capped at 256
- Vectorization: int64_t → int32_t → int16_t selected by head_bytes alignment (no uint4)
- Compile-time LMCACHE_TO_ENGINE and FORMAT template params eliminate runtime branches
- Supports 5 NHD + MLA formats: NB_NL_TWO_BS_NH_HS, NL_X_TWO_NB_BS_NH_HS, NL_X_NB_TWO_BS_NH_HS, NL_X_NB_BS_HS, NL_X_NBBS_ONE_HS; unsupported formats throw std::runtime_error
tests/v1/test_mp_mem_kernels_sycl.py — D2H→H2D roundtrip and skip_prefix_n_blocks tests for all 5 formats on XPU

Modified files

csrc/sycl/pybind_sycl.cpp — Adds #include <pybind11/stl.h>, includes mp_mem_kernels_sycl.h, registers PageBufferShapeDesc class and multi_layer_block_kv_transfer in xpu_ops (matching the c_ops binding structure)
setup.py — Adds mp_mem_kernels_sycl.cpp to sycl_sources

Usage

import lmcache.xpu_ops as xpu_ops

shape_desc = xpu_ops.PageBufferShapeDesc()
shape_desc.kv_size = 2
shape_desc.nl, shape_desc.nb, shape_desc.bs = nl, nb, bs
shape_desc.nh, shape_desc.hs, shape_desc.element_size = nh, hs, 2
shape_desc.block_stride_elems = 0

xpu_ops.multi_layer_block_kv_transfer(
    paged_buffer_ptrs_tensor, lmcache_objects_ptrs,
    block_ids, device,
    xpu_ops.TransferDirection.D2H,
    shape_desc, lmcache_chunk_size,
    xpu_ops.GPUKVFormat.NL_X_TWO_NB_BS_NH_HS,
    skip_prefix_n_blocks=0,
)

hlin99

tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] FAILED [ 10%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] FAILED [ 20%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] FAILED [ 30%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] FAILED [ 40%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] FAILED [ 50%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] FAILED [ 60%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] FAILED [ 70%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] FAILED [ 80%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] FAILED [ 90%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] FAILED [100%]

========================================================================================================== FAILURES ==========================================================================================================
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-cross_layer] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:

self = <random.Random object at 0x2e662100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-normal] _________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:

self = <random.Random object at 0x2e684800>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-flash_infer] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:

self = <random.Random object at 0x2e6a9c10>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
__________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-mla] ___________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:

self = <random.Random object at 0x2e6d93a0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-sglang_mla] _______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """
    D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.

    Uses disjoint source and target block IDs so the result is unambiguous.
    """
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:218:

self = <random.Random object at 0x2e6b59f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-cross_layer] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:

self = <random.Random object at 0x2e6df070>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-normal] ________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:

self = <random.Random object at 0x2e6835f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-flash_infer] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:

self = <random.Random object at 0x2e6b5100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-mla] __________________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:

self = <random.Random object at 0x2e6d7b30>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-sglang_mla] ______________________________________________________________________________________

gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16

@pytest.mark.parametrize(
    "gpu_kv_format,nl,nh,hs,is_mla",
    FORMAT_PARAMS,
    ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
    """Verify skip_prefix_n_blocks skips the first N blocks globally."""
    device = torch.device("xpu")
    kv_dim = 1 if is_mla else 2
    hidden_dim = nh * hs
    skip = 4

    source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
    target_vllm = create_zero_vllm_tensors(
        gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
    )
    mem_objects = create_memory_objects(
        kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
    )

    rng_d2h = random.Random(42)
    block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
    excluded = set(block_ids_d2h)
    available = [i for i in range(NB) if i not in excluded]
    rng_h2d = random.Random(123)

  block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/v1/test_mp_mem_kernels_sycl.py:290:

self = <random.Random object at 0x2e6e0c80>, population = [1, 8, 9, 10, 15, 21, ...], k = 64

def sample(self, population, k, *, counts=None):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    Repeated elements can be specified one at a time or with the optional
    counts parameter.  For example:

        sample(['red', 'blue'], counts=[4, 2], k=5)

    is equivalent to:

        sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

    To choose a sample from a range of integers, use range() for the
    population argument.  This is especially fast and space efficient
    for sampling from a large population:

        sample(range(10000000), 60)

    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    # The number of calls to _randbelow() is kept at or near k, the
    # theoretical minimum.  This is important because running time
    # is dominated by _randbelow() and because it extracts the
    # least entropy from the underlying random number generators.

    # Memory requirements are kept to the smaller of a k-length
    # set or an n-length list.

    # There are other sampling algorithms that do not require
    # auxiliary memory, but they were rejected because they made
    # too many calls to _randbelow(), making them slower and
    # causing them to eat more entropy than necessary.

    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence.  "
                        "For dicts or sets, use sorted(d).")
    n = len(population)
    if counts is not None:
        cum_counts = list(_accumulate(counts))
        if len(cum_counts) != n:
            raise ValueError('The number of counts does not match the population')
        total = cum_counts.pop()
        if not isinstance(total, int):
            raise TypeError('Counts must be integers')
        if total <= 0:
            raise ValueError('Total of counts must be greater than zero')
        selections = self.sample(range(total), k=k)
        bisect = _bisect
        return [population[bisect(cum_counts, s)] for s in selections]
    randbelow = self._randbelow
    if not 0 <= k <= n:

      raise ValueError("Sample larger than population or is negative")

E ValueError: Sample larger than population or is negative

/usr/lib/python3.12/random.py:430: ValueError
====================================================================================================== warnings summary ======================================================================================================
../../../opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: 50 warnings
/opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: DeprecationWarning: LoggingHandler in opentelemetry-sdk is deprecated. Use the handler from opentelemetry-instrumentation-logging instead.
warnings.warn(

:488
:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute

:488
:488: DeprecationWarning: builtin type SwigPyObject has no module attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================================== short test summary info ===================================================================================================
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
============================================================================================== 10 failed, 52 warnings in 1.55s ===============================================================================================
root@b60-8-hq-1:~/hlin76/LMCache#
@copilot

Copilot · 2026-06-15T02:28:02Z

tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] FAILED ...

Fixed in commit b1d9cc7. The root cause was a ValueError: Sample larger than population at:

block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)  # line 218

With NB=100 and TOTAL_BLOCKS=64, sampling 64 D2H block IDs from range(100) leaves only 100 - 64 = 36 blocks in available, but we then tried to sample 64 H2D block IDs from those 36 — impossible. Changed NB from 100 to 200 so available always contains 200 - 64 = 136 ≥ 64 entries.

Initial plan

780ce1d

Copilot AI assigned Copilot and hlin99 Jun 12, 2026

Copilot started work on behalf of hlin99 June 12, 2026 05:25 View session

Copilot AI added 2 commits June 12, 2026 05:34

Implement SYCL multi_layer_block_kv_transfer kernel

8124b89

Fix test helper: create tensors directly with target dtype

997a38d

Copilot AI changed the title ~~[WIP] Implement SYCL version of multi_layer_block_kv_transfer~~ Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU Jun 12, 2026

Copilot finished work on behalf of hlin99 June 12, 2026 05:38

Copilot AI requested a review from hlin99 June 12, 2026 05:38

hlin99 requested changes Jun 15, 2026

View reviewed changes

Copilot started work on behalf of hlin99 June 15, 2026 02:16 View session

Fix test: increase NB to 200 so disjoint D2H/H2D block ID sets fit

e9052be

Copilot finished work on behalf of hlin99 June 15, 2026 02:28

Copilot AI requested a review from hlin99 June 15, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367

Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367
Copilot wants to merge 4 commits into
ww26_PR_sycl_mpfrom
copilot/implement-multi-layer-block-kv-transfer-sycl

Copilot AI commented Jun 12, 2026 •

edited

Loading

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New files

Modified files

Usage

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 12, 2026 •

edited

Loading