Skip to content

B300 MLPERF #1429

@ninano1208

Description

@ninano1208

I'm running the MLPERF inference benchmark test with Llama2-70 on a server equipped with a B300 GPU. However, when I use the command below, the processing stops.

make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline

[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 16574 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.92 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.54 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 94.97 GiB, available: 71.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 52436
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 64.01 GiB for max tokens in paged KV cache (1677952).
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Executor instance created by worker
[2026-01-30 02:54:24,462 utils.py:76 INFO] [ core#1 ] [ core.py:125 init ] Executor Using Devices: #[4, 5, 6, 7].
[2026-01-30 02:54:24,464 utils.py:76 INFO] [ LLMSUT ] [ mlperf_frontend.py:77 init ] Initialized SUT.
[2026-01-30 02:54:24,464 runner.py:223 INFO] Start Warm Up.
[2026-01-30 02:54:24,464 runner.py:225 INFO] Warm Up Done.
[2026-01-30 02:54:24,464 runner.py:227 INFO] Start Test.
[TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.
Processing: | | 0/0 00:00<? user@user:/work$ [TensorRT-LLM][WARNING] The 'numReturnSequences' in the Request class is deprecated and will be removed in a future release. Please set the number of return sequences directly in 'SamplingConfig'.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions