Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367
Add SYCL implementation of multi_layer_block_kv_transfer for Intel XPU#367Copilot wants to merge 4 commits into
Conversation
hlin99
left a comment
There was a problem hiding this comment.
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] FAILED [ 10%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] FAILED [ 20%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] FAILED [ 30%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] FAILED [ 40%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] FAILED [ 50%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] FAILED [ 60%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] FAILED [ 70%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] FAILED [ 80%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] FAILED [ 90%]
tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] FAILED [100%]
========================================================================================================== FAILURES ==========================================================================================================
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-cross_layer] _______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""
D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.
Uses disjoint source and target block IDs so the result is unambiguous.
"""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:218:
self = <random.Random object at 0x2e662100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-normal] _________________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""
D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.
Uses disjoint source and target block IDs so the result is unambiguous.
"""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:218:
self = <random.Random object at 0x2e684800>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-flash_infer] _______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""
D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.
Uses disjoint source and target block IDs so the result is unambiguous.
"""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:218:
self = <random.Random object at 0x2e6a9c10>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
__________________________________________________________________________________________ test_block_transfer_roundtrip[bf16-mla] ___________________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""
D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.
Uses disjoint source and target block IDs so the result is unambiguous.
"""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:218:
self = <random.Random object at 0x2e6d93a0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
_______________________________________________________________________________________ test_block_transfer_roundtrip[bf16-sglang_mla] _______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_roundtrip(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""
D2H → H2D roundtrip: data written via D2H must be recoverable via H2D.
Uses disjoint source and target block IDs so the result is unambiguous.
"""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:218:
self = <random.Random object at 0x2e6b59f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-cross_layer] ______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NB_NL_TWO_BS_NH_HS: 0>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""Verify skip_prefix_n_blocks skips the first N blocks globally."""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
skip = 4
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:290:
self = <random.Random object at 0x2e6df070>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-normal] ________________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_TWO_NB_BS_NH_HS: 1>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""Verify skip_prefix_n_blocks skips the first N blocks globally."""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
skip = 4
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:290:
self = <random.Random object at 0x2e6835f0>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
_____________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-flash_infer] ______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NB_TWO_BS_NH_HS: 2>, nl = 4, nh = 8, hs = 128, is_mla = False, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""Verify skip_prefix_n_blocks skips the first N blocks globally."""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
skip = 4
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:290:
self = <random.Random object at 0x2e6b5100>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
_________________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-mla] __________________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NB_BS_HS: 3>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""Verify skip_prefix_n_blocks skips the first N blocks globally."""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
skip = 4
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:290:
self = <random.Random object at 0x2e6d7b30>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
______________________________________________________________________________________ test_block_transfer_skip_prefix[bf16-sglang_mla] ______________________________________________________________________________________
gpu_kv_format = <GPUKVFormat.NL_X_NBBS_ONE_HS: 5>, nl = 4, nh = 1, hs = 576, is_mla = True, dtype = torch.bfloat16
@pytest.mark.parametrize(
"gpu_kv_format,nl,nh,hs,is_mla",
FORMAT_PARAMS,
ids=["cross_layer", "normal", "flash_infer", "mla", "sglang_mla"],
)
@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
def test_block_transfer_skip_prefix(gpu_kv_format, nl, nh, hs, is_mla, dtype):
"""Verify skip_prefix_n_blocks skips the first N blocks globally."""
device = torch.device("xpu")
kv_dim = 1 if is_mla else 2
hidden_dim = nh * hs
skip = 4
source_vllm = create_vllm_tensors(gpu_kv_format, nl, NB, BS, nh, hs, dtype, device)
target_vllm = create_zero_vllm_tensors(
gpu_kv_format, nl, NB, BS, nh, hs, dtype, device
)
mem_objects = create_memory_objects(
kv_dim, nl, TOKENS_PER_OBJECT, hidden_dim, NUM_MEMORY_OBJECTS, dtype, device
)
rng_d2h = random.Random(42)
block_ids_d2h = rng_d2h.sample(range(NB), TOTAL_BLOCKS)
excluded = set(block_ids_d2h)
available = [i for i in range(NB) if i not in excluded]
rng_h2d = random.Random(123)
block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/v1/test_mp_mem_kernels_sycl.py:290:
self = <random.Random object at 0x2e6e0c80>, population = [1, 8, 9, 10, 15, 21, ...], k = 64
def sample(self, population, k, *, counts=None):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
Repeated elements can be specified one at a time or with the optional
counts parameter. For example:
sample(['red', 'blue'], counts=[4, 2], k=5)
is equivalent to:
sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)
To choose a sample from a range of integers, use range() for the
population argument. This is especially fast and space efficient
for sampling from a large population:
sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
# The number of calls to _randbelow() is kept at or near k, the
# theoretical minimum. This is important because running time
# is dominated by _randbelow() and because it extracts the
# least entropy from the underlying random number generators.
# Memory requirements are kept to the smaller of a k-length
# set or an n-length list.
# There are other sampling algorithms that do not require
# auxiliary memory, but they were rejected because they made
# too many calls to _randbelow(), making them slower and
# causing them to eat more entropy than necessary.
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence. "
"For dicts or sets, use sorted(d).")
n = len(population)
if counts is not None:
cum_counts = list(_accumulate(counts))
if len(cum_counts) != n:
raise ValueError('The number of counts does not match the population')
total = cum_counts.pop()
if not isinstance(total, int):
raise TypeError('Counts must be integers')
if total <= 0:
raise ValueError('Total of counts must be greater than zero')
selections = self.sample(range(total), k=k)
bisect = _bisect
return [population[bisect(cum_counts, s)] for s in selections]
randbelow = self._randbelow
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
E ValueError: Sample larger than population or is negative
/usr/lib/python3.12/random.py:430: ValueError
====================================================================================================== warnings summary ======================================================================================================
../../../opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: 50 warnings
/opt/venv/lib/python3.12/site-packages/opentelemetry/sdk/_logs/_internal/init.py:531: DeprecationWarning: LoggingHandler in opentelemetry-sdk is deprecated. Use the handler from opentelemetry-instrumentation-logging instead.
warnings.warn(
:488
:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute
:488
:488: DeprecationWarning: builtin type SwigPyObject has no module attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================================== short test summary info ===================================================================================================
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_roundtrip[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-cross_layer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-normal] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-flash_infer] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-mla] - ValueError: Sample larger than population or is negative
FAILED tests/v1/test_mp_mem_kernels_sycl.py::test_block_transfer_skip_prefix[bf16-sglang_mla] - ValueError: Sample larger than population or is negative
============================================================================================== 10 failed, 52 warnings in 1.55s ===============================================================================================
root@b60-8-hq-1:~/hlin76/LMCache#
@copilot
Fixed in commit b1d9cc7. The root cause was a block_ids_h2d = rng_h2d.sample(available, TOTAL_BLOCKS) # line 218With |
Ports the block-level multi-layer KV transfer kernel (
multi_layer_block_kv_transfer) to SYCL so it runs on Intel XPU without any CUDA dependency.New files
csrc/sycl/mp_mem_kernels_sycl.h— Standalone header withPageBufferShapeDesc(plaininline, no__host__ __device__),MemoryObj4<T>, and the function declaration. Compiles withicpxalone; deleting all CUDA sources still builds.csrc/sycl/mp_mem_kernels_sycl.cpp— SYCL kernel implementation:nd_range<3>grid(kv_size, total_blocks, nl)with flat 1D local range;[[sycl::reqd_sub_group_size(16)]]; WG size rounded to SG-16 multiple, capped at 256int64_t→int32_t→int16_tselected byhead_bytesalignment (nouint4)LMCACHE_TO_ENGINEandFORMATtemplate params eliminate runtime branchesNB_NL_TWO_BS_NH_HS,NL_X_TWO_NB_BS_NH_HS,NL_X_NB_TWO_BS_NH_HS,NL_X_NB_BS_HS,NL_X_NBBS_ONE_HS; unsupported formats throwstd::runtime_errortests/v1/test_mp_mem_kernels_sycl.py— D2H→H2D roundtrip andskip_prefix_n_blockstests for all 5 formats on XPUModified files
csrc/sycl/pybind_sycl.cpp— Adds#include <pybind11/stl.h>, includesmp_mem_kernels_sycl.h, registersPageBufferShapeDescclass andmulti_layer_block_kv_transferinxpu_ops(matching thec_opsbinding structure)setup.py— Addsmp_mem_kernels_sycl.cpptosycl_sourcesUsage