ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Overview

ShoppingBench is a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products.

Features

various real-world shopping intents
a large-scale shopping sandbox
Comprehensive evaluation metrics

Dataset

The ShoppingBench dataset includes:

documents.jsonl.gz: A compressed file containing product documents (located in resources/ directory)
- To decompress: gunzip -c resources/documents.jsonl.gz > resources/documents.jsonl
- Size: ~1.4GB compressed, ~4.8GB uncompressed
Test files: Located in the data/ directory
- synthesize_product_test.jsonl: Product Intent test cases
- synthesize_shop_test.jsonl: Shop Intent test cases
- synthesize_voucher_test.jsonl: Voucher Intent test cases
- synthesize_web_simpleqa_test.jsonl: Web search Intent test cases

Environment Setup

prerequist

install java (jdk21 recommended)
install uv
decompress documents.jsonl.gz to get documents.jsonl in resources folder:
```
gunzip -c resources/documents.jsonl.gz > resources/documents.jsonl
```
prepare related KEY

export OPENAI_API_KEY="your openai api key"
export OPENAI_BASE_URL="your openai base url"
export SERPER_KEY="your serper web search key"

Python Environment Installation and Search Engine Preparation

Run the initialization script to set up the Python environment and start the product search engine:

./init_env.sh

After running the environment setup script, the search engine will be automatically started in the background.

Running Inference and Evaluation

To run model inference on test data and evaluate the models for different intents:

Run the inference scripts (take gpt-4.1 as example):

The script will automatically create necessary directories and validate the data folder structure before running model inference and evaluation.
```
./run.sh product rollout gpt-4.1
./run.sh shop rollout gpt-4.1
./run.sh voucher rollout gpt-4.1
./run.sh web simpleqa_rollout gpt-4.1
```
the inference process will be running in background, you can check the log in logs folder. you can uncomment the specific line to evaluate the inference result or kill the inference process.
Run the evaluation scripts (take gpt-4.1 as example):

Please update run.sh by uncommenting the line for run_evaluate.py and commenting out the line for run_rollout, then rerun the scripts.
```
./run.sh product rollout gpt-4.1
./run.sh shop rollout gpt-4.1
./run.sh voucher rollout gpt-4.1
./run.sh web simpleqa_rollout gpt-4.1
```

Training SFT and RL Models

SFT Environment Installation

Install SFT environment and its dependencies (llama factory):

cd src/sft/LLaMA-Factory
uv pip install -e ".[torch,metrics,deepspeed]" --no-build-isolation

RL Environment Installation

Install RL environment and its dependencies:

cd src/rl
USE_MEGATRON=0 bash install_vllm_sglang_mcore.sh
# verl
uv pip install -e .

Usage

To train the SFT and RL models:

For SFT training:

prepare sft data
```
cd src/sft
mkdir data
```
place dataset_info.json and training data in the data directory
run sft scripts

   ./submit.sh yaml_file

For RL training:

run rl scripts
```
cd src/rl
./run_grpo.sh
```

Paper

For more details about ShoppingBench, please refer to our paper

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
data		data
img		img
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_index.sh		build_index.sh
init_env.sh		init_env.sh
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Overview

Features

Dataset

Environment Setup

prerequist

Python Environment Installation and Search Engine Preparation

Running Inference and Evaluation

Training SFT and RL Models

SFT Environment Installation

RL Environment Installation

Usage

Paper

About

Uh oh!

Releases

Packages

Languages

License

kejunxiao/ShoppingBench

Folders and files

Latest commit

History

Repository files navigation

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Overview

Features

Dataset

Environment Setup

prerequist

Python Environment Installation and Search Engine Preparation

Running Inference and Evaluation

Training SFT and RL Models

SFT Environment Installation

RL Environment Installation

Usage

Paper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages