Use the implemented evaluation pipeline to run zero-shot prompts on the selected model set using the prepared question set.
This issue is about execution, not infrastructure design.
The run should specify:
- which models are included
- which question set version is used
- which prompt template is used
- where outputs are stored
The goal is to produce the first baseline dataset for later scoring and analysis.
Acceptance criteria
- The selected models for this run are listed.
- The question set version used for the run is identified.
- Zero-shot prompts are executed on all selected models or a clearly stated subset.
- Outputs are stored in the agreed structured format.
- Any failed requests are preserved in the logs rather than silently dropped.
- A short result summary is added to the issue or repository notes.
Use the implemented evaluation pipeline to run zero-shot prompts on the selected model set using the prepared question set.
This issue is about execution, not infrastructure design.
The run should specify:
The goal is to produce the first baseline dataset for later scoring and analysis.
Acceptance criteria