Skip to content

Conversation

@mwxely
Copy link
Collaborator

@mwxely mwxely commented Jan 16, 2026

Summary

This PR adds model stability measurement capabilities to lmms-eval, enabling users to assess model consistency by running multiple samples per question.

Motivation

Same accuracy does not mean same reliability:

| Model   | EA  | CA  | IV   | Note       |
|---------|-----|-----|------|------------|
| Model A | 80% | 82% | 0.05 | ← Stable   |
| Model B | 80% | 81% | 0.15 | ← Unstable |

Model A and B have the same Expected Accuracy, but Model A is 3× more stable (lower IV).

Changes

New CLI Parameter

lmms-eval --model xxx --tasks xxx -n 5  # or --num_samples 5

When n > 1, enables k-samples mode and computes stability metrics.

New Metrics

Metric Full Name Description
EA Expected Accuracy Mean accuracy across all k samples
CA Consensus Accuracy Accuracy after majority voting
IV Internal Variance Average per-question variance (lower = more stable)
CR Consistency Rate % of questions with identical answers across all k samples

Example Output

|Task|Metric|Value|Stderr|Stderr_CLT|Stderr_Clustered|EA |CA |IV |CR |
|----|------|-----|------|----------|----------------|---|---|---|---|
|mme |score |85.0 |N/A   |0.0435    |0.0512          |0.8|0.82|0.05|0.75|

Files Changed

File Changes
lmms_eval/api/metrics.py Add expected_accuracy, consensus_accuracy, internal_variance, consistency_rate functions
lmms_eval/evaluator_utils.py Add calculate_stability_metrics() method
lmms_eval/evaluator.py Override task repeats when num_samples > 1, call stability calculation
lmms_eval/__main__.py Add -n/--num_samples CLI parameter
lmms_eval/utils.py Display EA, CA, IV, CR columns in output table

Test Results

  • Full-set test with --num_samples 3 on VideoMME
cbed58010757bb77575494b439b30cdf
  • Local cicd test
bc1ed5e95a3f7c3ac35482c6d6edba16

Add four new metrics for measuring model stability in k-samples mode:
- expected_accuracy: mean accuracy across all k samples
- consensus_accuracy: accuracy after majority voting
- internal_variance: average variance within each question (lower is better)
- consistency_rate: fraction of questions with consistent answers

Reference: HackMD v0.6 roadmap section 2.5 Model Stability Measurement
- Add calculate_stability_metrics() method to TaskOutput class
  Groups scores by question and computes EA, CA, IV, CR when repeats > 1
- Update consolidate_results() to include stability metrics in output

The metrics are only computed when num_samples > 1 (k-samples mode).
- Override task repeats with num_samples when n > 1 for stability measurement
- Call calculate_stability_metrics() after aggregate metric calculation

When --num_samples is set > 1, the evaluator runs each question k times
to measure model consistency and stability.
…ment

Add CLI argument to enable k-samples mode:
  -n, --num_samples: Number of samples per question (default: 1)

When n > 1, enables k-samples mode and computes stability metrics
(EA, CA, IV, CR) to measure model consistency.

Usage: lmms-eval --model xxx --tasks xxx -n 5
…able

- Add EA, CA, IV, CR columns to make_table() output
- Skip stability metric variants in main metric loop (shown as columns)

Example output:
|Task|Metric|Value|Stderr|Stderr_CLT|Stderr_Clustered|EA  |CA  |IV  |CR  |
|----|------|-----|------|----------|----------------|----|----|----|----|
|mme |score |85.0 |N/A   |0.0435    |0.0512          |0.80|0.82|0.05|0.75|
@mwxely mwxely requested review from Luodian and kcz358 January 16, 2026 05:48
@mwxely mwxely changed the title # [feat] Add model stability measurement (EA, CA, IV, CR) for k-samples mode [feat] Add model stability measurement (EA, CA, IV, CR) for k-samples mode Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants