ShoppingBench is a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products.
- various real-world shopping intents
- a large-scale shopping sandbox
- Comprehensive evaluation metrics
The ShoppingBench dataset includes:
-
documents.jsonl.gz: A compressed file containing product documents (located in
resources/directory)- To decompress:
gunzip -c resources/documents.jsonl.gz > resources/documents.jsonl - Size: ~1.4GB compressed, ~4.8GB uncompressed
- To decompress:
-
Test files: Located in the
data/directorysynthesize_product_test.jsonl: Product Intent test casessynthesize_shop_test.jsonl: Shop Intent test casessynthesize_voucher_test.jsonl: Voucher Intent test casessynthesize_web_simpleqa_test.jsonl: Web search Intent test cases
-
install java (jdk21 recommended)
-
install uv
-
decompress documents.jsonl.gz to get documents.jsonl in resources folder:
gunzip -c resources/documents.jsonl.gz > resources/documents.jsonl -
prepare related KEY
export OPENAI_API_KEY="your openai api key"
export OPENAI_BASE_URL="your openai base url"
export SERPER_KEY="your serper web search key"Run the initialization script to set up the Python environment and start the product search engine:
./init_env.shAfter running the environment setup script, the search engine will be automatically started in the background.
To run model inference on test data and evaluate the models for different intents:
-
Run the inference scripts (take gpt-4.1 as example):
The script will automatically create necessary directories and validate the data folder structure before running model inference and evaluation.
./run.sh product rollout gpt-4.1 ./run.sh shop rollout gpt-4.1 ./run.sh voucher rollout gpt-4.1 ./run.sh web simpleqa_rollout gpt-4.1
the inference process will be running in background, you can check the log in logs folder. you can uncomment the specific line to evaluate the inference result or kill the inference process.
-
Run the evaluation scripts (take gpt-4.1 as example):
Please update run.sh by uncommenting the line for run_evaluate.py and commenting out the line for run_rollout, then rerun the scripts.
./run.sh product rollout gpt-4.1 ./run.sh shop rollout gpt-4.1 ./run.sh voucher rollout gpt-4.1 ./run.sh web simpleqa_rollout gpt-4.1
Install SFT environment and its dependencies (llama factory):
cd src/sft/LLaMA-Factory
uv pip install -e ".[torch,metrics,deepspeed]" --no-build-isolationInstall RL environment and its dependencies:
cd src/rl
USE_MEGATRON=0 bash install_vllm_sglang_mcore.sh
# verl
uv pip install -e .To train the SFT and RL models:
For SFT training:
-
prepare sft data
cd src/sft mkdir dataplace dataset_info.json and training data in the data directory
-
run sft scripts
./submit.sh yaml_fileFor RL training:
- run rl scripts
cd src/rl ./run_grpo.sh
For more details about ShoppingBench, please refer to our paper
