Skip to content

Releases: SpringCare/VERA-MH

VERA-MH v1.1.1

16 Jun 22:48
d4d1934

Choose a tag to compare

VERA-MH v1.1.1

VERA-MH v1.1.1 updates the recommended judge to GPT 5.4, supports latest models like Claude Opus 4.8, and adds tooling to turn judge output into more actionable improvement reports. The core scoring formula is unchanged from v1.1.

While this release includes incremental judge improvements, incremental work is underway to further strengthen alignment with human clinician evaluations.

Highlights

  • GPT 5.4 is now the recommended judge model (replacing the v1.1 dual-judge setup of GPT-4o + Claude Sonnet 4.5). Internal inter-rater reliability (IRR) research found GPT 5.4 most aligned with human clinician ratings on a set of 50 newly simulated conversations.
  • Improvement reports — new scripts/summarize_results.py turns results.csv into a Markdown breakdown of where an AI provider failed rubric standards, grouped by dimension and question.
  • More stable Gemini judging via json_schema structured outputs, with better failure diagnostics in judge logs.
  • Improved logging for judging where judge logs now contain metadata such as token usage.

v1.1.1 Scores

vera_scores_v1 1_gpt5 4j_20260611_110001

Recommended evaluation profile

Setting v1.1.1 value
Personas All 100 rows in data/personas.tsv
User agents GPT 5.2 + Claude Opus 4.5 (100 conversations each)
Turns 30 per conversation
Judge GPT 5.4 (gpt-5.4)

Run the full automated flow:

./scripts/run_recommended_vera_pipeline.sh <provider-agent>

Override the judge with VERA_JUDGE if needed (default: gpt-5.4). See scripts/run_recommended_vera_pipeline.sh for other environment variables.

What's new

Loading judge results

  • Fix: read_results_csv() preserves persona risk level None when pooling or rescoring (pandas no longer coerces the literal string "None" to NA).

Improvement reports

After judging, summarize where standards were not met:

uv run python3 scripts/summarize_results.py \
  --results output/{YOUR_P_RUN}/evaluations/{YOUR_J_RUN}/results.csv \
  --rubric data/rubric.tsv \
  --out-stats output/{YOUR_J_RUN}/improvement_stats.json \
  --out-md output/{YOUR_J_RUN}/improvement_report.md

Runtime and reliability

  • LLM params: unsupported model parameters are filtered before API calls. Claude Opus 4.8 rejects temperature; it is stripped via _unsupported_model_params.
  • Gemini judges: LangSmith structured output now uses json_schema for improved stability.
  • Judge logs: judge logs now include structured-output details, from tokens used to exceptions.

Automation

  • scripts/run_recommended_vera_pipeline.sh — single GPT 5.4 judge (VERA_JUDGE); dual user-agent suites unchanged. Pooled output uses j_<judge>__p_... naming instead of j_pooled__....

Migration from v1.1

  1. Judge model — If you used the v1.1 recommended dual-judge flow, switch to GPT 5.4 for new runs. Replace VERA_JUDGE_A / VERA_JUDGE_B with VERA_JUDGE.
  2. Score labels — New score artifacts and comparison graphics are labeled v1.1.1; the underlying formula is the same as v1.1.
  3. Pooling — Rescoring or pooling existing results.csv files benefits from the None risk-level read fix; no action required beyond upgrading.

Full changelog

See CHANGELOG.md for the complete list of changes.

VERA-MH v1.1.0

29 Apr 20:39
dc9780e

Choose a tag to compare

VERA-MH v1.1.0 — Release notes

VERA-MH helps teams simulate and evaluate their chatbot's mental-health conversations against a safety rubric. Version 1.1 expands on personas to simulate, refines how we score safety and care, and makes large evaluation runs easier to run, resume, and audit.

Latest Scores

scores_20260428_122725

What’s new

Richer, more realistic simulations

We increased the persona library from 10 to 100 personas, spanning a wider range of situations and risk levels. We also updated how the simulated user is instructed to behave, including a clearer emphasis on staying in the role of a real person seeking help, not a counselor, so conversations better reflect how people actually show up in a chat.

Clearer safety scoring

The evaluation rubric was updated based on feedback from many external stakeholders and clinicians.

In practice, examples of this include:

  • Guides to human care scores consider context more carefully—for example, whether the person is already connected to crisis support, and whether the person is experiencing suicidal urges during the conversation.
  • High potential for harm is distinguished more clearly from suboptimal responses (for example, omitting crisis resource information as high potential for harm versus not fully addressing barriers to using resources as suboptimal).
  • Dimensions are less coupled: a serious miss in one area no longer automatically forces a miss of the same severity in another related dimension.

Because of the updates to the rubric and personas, aggregate scores are not directly comparable to v1.0. On average, general model scores may score slightly higher (typically on the order of a few points) than on the previous version.

More reliable long runs

Calls to AI solutions now use retries and timeouts by default (up to three retries, with a short wait between attempts). If a single conversation or evaluation fails, the run can continue instead of stopping the whole batch. Testers/developers can also resume an interrupted simulation or judging run so you do not redo work that has already finished.

Clearer outputs for review

Judge activity is logged in a predictable, per-job layout (one log per conversation, judge model, and evaluation instance). The README describes a default layout that keeps each generation run’s transcripts, nested evaluations/ batches, scores, and related artifacts under a single parent (typically output/ with timestamped p_* run folders) so outputs are easier to find, resume, and share.

Reliable VERA Scoring

The README now describes recommended settings for a stable, comparable headline score (2 conversations per each of the 100 personas, dual user-agent simulators, 30 turns, dual judges, and optional pooling), and includes a helper script scripts/run_recommended_vera_pipeline.sh to run conversations, evaluations, scoring, and obtaining the overall VERA score for the provided chatbot.

Where to learn more

  • Getting started and recommended settings: repository README
  • Custom providers (private APIs or unsupported models): README section Connecting your own LLM, Agent, or API and docs/evaluating.md
  • Research context: links in the README Additional Resources section (including the reliability/validity preprint)
  • General VERA-MH information: visit the website at VERA-MH.com

For a concise, technical list of changes, see CHANGELOG.md in the repository root.