Releases: SpringCare/VERA-MH
VERA-MH v1.1.1
VERA-MH v1.1.1
VERA-MH v1.1.1 updates the recommended judge to GPT 5.4, supports latest models like Claude Opus 4.8, and adds tooling to turn judge output into more actionable improvement reports. The core scoring formula is unchanged from v1.1.
While this release includes incremental judge improvements, incremental work is underway to further strengthen alignment with human clinician evaluations.
Highlights
- GPT 5.4 is now the recommended judge model (replacing the v1.1 dual-judge setup of GPT-4o + Claude Sonnet 4.5). Internal inter-rater reliability (IRR) research found GPT 5.4 most aligned with human clinician ratings on a set of 50 newly simulated conversations.
- Improvement reports — new
scripts/summarize_results.pyturnsresults.csvinto a Markdown breakdown of where an AI provider failed rubric standards, grouped by dimension and question. - More stable Gemini judging via
json_schemastructured outputs, with better failure diagnostics in judge logs. - Improved logging for judging where judge logs now contain metadata such as token usage.
v1.1.1 Scores
Recommended evaluation profile
| Setting | v1.1.1 value |
|---|---|
| Personas | All 100 rows in data/personas.tsv |
| User agents | GPT 5.2 + Claude Opus 4.5 (100 conversations each) |
| Turns | 30 per conversation |
| Judge | GPT 5.4 (gpt-5.4) |
Run the full automated flow:
./scripts/run_recommended_vera_pipeline.sh <provider-agent>Override the judge with VERA_JUDGE if needed (default: gpt-5.4). See scripts/run_recommended_vera_pipeline.sh for other environment variables.
What's new
Loading judge results
- Fix:
read_results_csv()preserves persona risk levelNonewhen pooling or rescoring (pandas no longer coerces the literal string"None"to NA).
Improvement reports
After judging, summarize where standards were not met:
uv run python3 scripts/summarize_results.py \
--results output/{YOUR_P_RUN}/evaluations/{YOUR_J_RUN}/results.csv \
--rubric data/rubric.tsv \
--out-stats output/{YOUR_J_RUN}/improvement_stats.json \
--out-md output/{YOUR_J_RUN}/improvement_report.mdRuntime and reliability
- LLM params: unsupported model parameters are filtered before API calls. Claude Opus 4.8 rejects
temperature; it is stripped via_unsupported_model_params. - Gemini judges: LangSmith structured output now uses
json_schemafor improved stability. - Judge logs: judge logs now include structured-output details, from tokens used to exceptions.
Automation
scripts/run_recommended_vera_pipeline.sh— single GPT 5.4 judge (VERA_JUDGE); dual user-agent suites unchanged. Pooled output usesj_<judge>__p_...naming instead ofj_pooled__....
Migration from v1.1
- Judge model — If you used the v1.1 recommended dual-judge flow, switch to GPT 5.4 for new runs. Replace
VERA_JUDGE_A/VERA_JUDGE_BwithVERA_JUDGE. - Score labels — New score artifacts and comparison graphics are labeled v1.1.1; the underlying formula is the same as v1.1.
- Pooling — Rescoring or pooling existing
results.csvfiles benefits from theNonerisk-level read fix; no action required beyond upgrading.
Full changelog
See CHANGELOG.md for the complete list of changes.
VERA-MH v1.1.0
VERA-MH v1.1.0 — Release notes
VERA-MH helps teams simulate and evaluate their chatbot's mental-health conversations against a safety rubric. Version 1.1 expands on personas to simulate, refines how we score safety and care, and makes large evaluation runs easier to run, resume, and audit.
Latest Scores
What’s new
Richer, more realistic simulations
We increased the persona library from 10 to 100 personas, spanning a wider range of situations and risk levels. We also updated how the simulated user is instructed to behave, including a clearer emphasis on staying in the role of a real person seeking help, not a counselor, so conversations better reflect how people actually show up in a chat.
Clearer safety scoring
The evaluation rubric was updated based on feedback from many external stakeholders and clinicians.
In practice, examples of this include:
- Guides to human care scores consider context more carefully—for example, whether the person is already connected to crisis support, and whether the person is experiencing suicidal urges during the conversation.
- High potential for harm is distinguished more clearly from suboptimal responses (for example, omitting crisis resource information as high potential for harm versus not fully addressing barriers to using resources as suboptimal).
- Dimensions are less coupled: a serious miss in one area no longer automatically forces a miss of the same severity in another related dimension.
Because of the updates to the rubric and personas, aggregate scores are not directly comparable to v1.0. On average, general model scores may score slightly higher (typically on the order of a few points) than on the previous version.
More reliable long runs
Calls to AI solutions now use retries and timeouts by default (up to three retries, with a short wait between attempts). If a single conversation or evaluation fails, the run can continue instead of stopping the whole batch. Testers/developers can also resume an interrupted simulation or judging run so you do not redo work that has already finished.
Clearer outputs for review
Judge activity is logged in a predictable, per-job layout (one log per conversation, judge model, and evaluation instance). The README describes a default layout that keeps each generation run’s transcripts, nested evaluations/ batches, scores, and related artifacts under a single parent (typically output/ with timestamped p_* run folders) so outputs are easier to find, resume, and share.
Reliable VERA Scoring
The README now describes recommended settings for a stable, comparable headline score (2 conversations per each of the 100 personas, dual user-agent simulators, 30 turns, dual judges, and optional pooling), and includes a helper script scripts/run_recommended_vera_pipeline.sh to run conversations, evaluations, scoring, and obtaining the overall VERA score for the provided chatbot.
Where to learn more
- Getting started and recommended settings: repository README
- Custom providers (private APIs or unsupported models): README section Connecting your own LLM, Agent, or API and docs/evaluating.md
- Research context: links in the README Additional Resources section (including the reliability/validity preprint)
- General VERA-MH information: visit the website at VERA-MH.com
For a concise, technical list of changes, see CHANGELOG.md in the repository root.