Minimal runner for executing ai-agentic-retailing-benchmark test scenarios against multiple AI platforms and saving a timestamped report.
- Python 3.9+ (standard library only; no external dependencies for REST platforms)
- Optional for GEMINI:
pip install -q -U google-genai
Run all scenarios in the default dataset setting and write a report:
python - <<'PY'
from test_runner import run_tests
run_tests()
PYOr use the CLI:
python main.py --setting retailing-benchmark --env .envCLI parameters:
--setting: Dataset setting name that selects a bundle of inputs (tests XLSX, ground truth, scoring prompt). Default:retailing-benchmark.--env: Path to the env file with platform credentials. Default:.env.--platform: Optional platform id(s) to run (e.g.GEMINIorGEMINI,CLAUDE). Case-insensitive.--exclude-platform: Optional comma-separated platform ids to skip (e.g.GEMINI,CLAUDE).--scenario-start: Optional scenario_id to start from (inclusive).--scenario-end: Optional scenario_id to stop at (inclusive).
Run only tests for a specific platform_id (case-insensitive):
python main.py --setting retailing-benchmark --env .env --platform GEMINIRun only a range of scenarios (inclusive):
python main.py --setting retailing-benchmark --env .env --scenario-start 10 --scenario-end 20By default, reports are written to reports/ with a timestamped filename like:
reports/test_report_20250101_120000.xlsx
Build the image:
docker build -t ai-agentic-retailing-benchmark .Run with auto-start (default CMD runs main.py):
docker run --rm -v "$PWD/.env:/app/.env" ai-agentic-retailing-benchmarkStore reports inside the container and fetch them later:
docker ps
docker exec -it <container_id> /bin/bash
ls /app/reportsStore reports on the host by mounting the reports directory:
mkdir -p reports
docker run --rm -v "$PWD/.env:/app/.env" -v "$PWD/reports:/app/reports" ai-agentic-retailing-benchmarkOverride the default command (example: run a different setting):
docker run --rm -v "$PWD/.env:/app/.env" ai-agentic-retailing-benchmark python main.py --setting retailing-benchmark --env .env --platform GEMINImain.py: CLI entrypoint for running tests.test_runner.py: Test runner logic (grouping, execution, scoring, reporting).platform_clients.py: Platform-specific API calls for each model/provider.config.py: Loads env values and platform configuration.input_loader/: Input loading utilities.input_loader/test_loader.py: XLSX test case loader.input_loader/product_ground_truth_loader.py: Product ground truth loader.reporter/: Reporting package.reporter/reporting.py: Report assembly and XLSX writing orchestration.reporter/report_xlsx.py: Low-level XLSX writer.retailing-benchmark/: Test inputs and prompts for the retailing benchmark setting.retailing-benchmark/shopping_paper_tests.xlsx: Test scenarios and steps.retailing-benchmark/product_ground_truth.xlsx: Product ground truth data.retailing-benchmark/scoring_prompt.txt: Scoring prompt template.reports/: Output reports (timestamped XLSX files).results/: Benchmark artifacts (paper + scored XLSX) for 100 multi-step scenarios across common models..env: Platform credentials (not committed).
Settings let you switch between different input bundles without changing CLI flags. The mapping
lives in main.py under DATASET_CONFIGS.
To add a new setting:
- Create a new folder with your inputs (tests XLSX, optional ground truth XLSX, optional scoring prompt).
- Add a new entry in
DATASET_CONFIGSwith the three file paths. - Run with
--setting your-setting-name.
- Scoring is skipped if the scoring prompt or ground truth file is missing for the selected setting.
- API or model call failures are captured in the
commentscolumn for the affected test step. - Scenarios are grouped by
scenario_idandplatform_id, and steps are executed instep_indexorder. - The report preserves the input XLSX columns and fills
full_model_responseandtext_model_responsewith the latest run outputs. - Reports are updated after each step to preserve partial progress if a run fails.
The .env file is not committed. Create one locally with per-platform credentials:
{MODEL}_BASE_URL={url}
{MODEL}_API_KEY={api-key}
{MODEL}_MODEL={model_version}
# SmarterSorting enrichment (required when using enrichment features)
SMARTERSORTING_API_KEY={api-key}
SMARTERSORTING_URL={full-url-to-enrich-endpoint}
SMARTERSORTING_BASE_URL={base-url-for-enrichment-api}
Supported models: CHATGPT, PERPLEX, CLAUDE, GEMINI, COPILOT.