ai-agentic-retailing-benchmark

Minimal runner for executing ai-agentic-retailing-benchmark test scenarios against multiple AI platforms and saving a timestamped report.

Requirements

Python 3.9+ (standard library only; no external dependencies for REST platforms)
Optional for GEMINI: pip install -q -U google-genai

Running Tests

Run all scenarios in the default dataset setting and write a report:

python - <<'PY'
from test_runner import run_tests

run_tests()
PY

Or use the CLI:

python main.py --setting retailing-benchmark --env .env

CLI parameters:

--setting: Dataset setting name that selects a bundle of inputs (tests XLSX, ground truth, scoring prompt). Default: retailing-benchmark.
--env: Path to the env file with platform credentials. Default: .env.
--platform: Optional platform id(s) to run (e.g. GEMINI or GEMINI,CLAUDE). Case-insensitive.
--exclude-platform: Optional comma-separated platform ids to skip (e.g. GEMINI,CLAUDE).
--scenario-start: Optional scenario_id to start from (inclusive).
--scenario-end: Optional scenario_id to stop at (inclusive).

Run only tests for a specific platform_id (case-insensitive):

python main.py --setting retailing-benchmark --env .env --platform GEMINI

Run only a range of scenarios (inclusive):

python main.py --setting retailing-benchmark --env .env --scenario-start 10 --scenario-end 20

By default, reports are written to reports/ with a timestamped filename like: reports/test_report_20250101_120000.xlsx

Docker

Build the image:

docker build -t ai-agentic-retailing-benchmark .

Run with auto-start (default CMD runs main.py):

docker run --rm -v "$PWD/.env:/app/.env" ai-agentic-retailing-benchmark

Store reports inside the container and fetch them later:

docker ps
docker exec -it <container_id> /bin/bash
ls /app/reports

Store reports on the host by mounting the reports directory:

mkdir -p reports
docker run --rm -v "$PWD/.env:/app/.env" -v "$PWD/reports:/app/reports" ai-agentic-retailing-benchmark

Override the default command (example: run a different setting):

docker run --rm -v "$PWD/.env:/app/.env" ai-agentic-retailing-benchmark python main.py --setting retailing-benchmark --env .env --platform GEMINI

Project Structure

main.py: CLI entrypoint for running tests.
test_runner.py: Test runner logic (grouping, execution, scoring, reporting).
platform_clients.py: Platform-specific API calls for each model/provider.
config.py: Loads env values and platform configuration.
input_loader/: Input loading utilities.
input_loader/test_loader.py: XLSX test case loader.
input_loader/product_ground_truth_loader.py: Product ground truth loader.
reporter/: Reporting package.
reporter/reporting.py: Report assembly and XLSX writing orchestration.
reporter/report_xlsx.py: Low-level XLSX writer.
retailing-benchmark/: Test inputs and prompts for the retailing benchmark setting.
retailing-benchmark/shopping_paper_tests.xlsx: Test scenarios and steps.
retailing-benchmark/product_ground_truth.xlsx: Product ground truth data.
retailing-benchmark/scoring_prompt.txt: Scoring prompt template.
reports/: Output reports (timestamped XLSX files).
results/: Benchmark artifacts (paper + scored XLSX) for 100 multi-step scenarios across common models.
.env: Platform credentials (not committed).

Dataset Settings

Settings let you switch between different input bundles without changing CLI flags. The mapping lives in main.py under DATASET_CONFIGS.

To add a new setting:

Create a new folder with your inputs (tests XLSX, optional ground truth XLSX, optional scoring prompt).
Add a new entry in DATASET_CONFIGS with the three file paths.
Run with --setting your-setting-name.

Notes

Scoring is skipped if the scoring prompt or ground truth file is missing for the selected setting.
API or model call failures are captured in the comments column for the affected test step.
Scenarios are grouped by scenario_id and platform_id, and steps are executed in step_index order.
The report preserves the input XLSX columns and fills full_model_response and text_model_response with the latest run outputs.
Reports are updated after each step to preserve partial progress if a run fails.

.env Configuration

The .env file is not committed. Create one locally with per-platform credentials:

{MODEL}_BASE_URL={url}
{MODEL}_API_KEY={api-key}
{MODEL}_MODEL={model_version}

# SmarterSorting enrichment (required when using enrichment features)
SMARTERSORTING_API_KEY={api-key}
SMARTERSORTING_URL={full-url-to-enrich-endpoint}
SMARTERSORTING_BASE_URL={base-url-for-enrichment-api}

Supported models: CHATGPT, PERPLEX, CLAUDE, GEMINI, COPILOT.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
datafiniti_integration		datafiniti_integration
input_loader		input_loader
marketing_claims		marketing_claims
reporter		reporter
results		results
retailing-benchmark		retailing-benchmark
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
main.py		main.py
platform_clients.py		platform_clients.py
test_runner.py		test_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-agentic-retailing-benchmark

Requirements

Running Tests

Docker

Project Structure

Dataset Settings

Notes

.env Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-agentic-retailing-benchmark

Requirements

Running Tests

Docker

Project Structure

Dataset Settings

Notes

.env Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages