I'm running the MLPERF inference benchmark test with Llama2-70 on a server equipped with a B300 GPU. However, when I use the command below, the processing stops.
make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 16574 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[2026-01-30 02:54:24,462 utils.py:76 INFO] [ core#1 ] [ core.py:125 init ] Executor Using Devices: #[4, 5, 6, 7].
[2026-01-30 02:54:24,464 utils.py:76 INFO] [ LLMSUT ] [ mlperf_frontend.py:77 init ] Initialized SUT.
[2026-01-30 02:54:24,464 runner.py:223 INFO] Start Warm Up.
[2026-01-30 02:54:24,464 runner.py:225 INFO] Warm Up Done.
[2026-01-30 02:54:24,464 runner.py:227 INFO] Start Test.
[TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.
Processing: | | 0/0 00:00<? user@user:/work$ [TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.
I'm running the MLPERF inference benchmark test with Llama2-70 on a server equipped with a B300 GPU. However, when I use the command below, the processing stops.
make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 16574 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[2026-01-30 02:54:24,462 utils.py:76 INFO] [ core#1 ] [ core.py:125 init ] Executor Using Devices: #[4, 5, 6, 7].
[2026-01-30 02:54:24,464 utils.py:76 INFO] [ LLMSUT ] [ mlperf_frontend.py:77 init ] Initialized SUT.
[2026-01-30 02:54:24,464 runner.py:223 INFO] Start Warm Up.
[2026-01-30 02:54:24,464 runner.py:225 INFO] Warm Up Done.
[2026-01-30 02:54:24,464 runner.py:227 INFO] Start Test.
[TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.
Processing: | | 0/0 00:00<? user@user:/work$ [TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.