Skip to content
Giovanni Cabral edited this page May 25, 2026 · 2 revisions

sqa-eval Wiki

Speech Quality Assessment — score your audio with neural MOS metrics and rank enhancement algorithms in minutes, not days.


Overview

sqa-eval is a Python library that wraps Uni-VERSA-Ext HuggingFace models into a simple API for scoring audio files, comparing enhancement algorithms, and exporting results as CSV, JSON, and plots.


API Reference

InferenceEngine — Raw Model Predictions

from sqa_eval import InferenceEngine

The lowest-level entry point. Loads a Uni-VERSA-Ext model and runs inference.

Constructor:

engine = InferenceEngine(model="5metric")
Parameter Type Description
model str Model alias ("5metric", "22metric") or a full HF repo ID

Device auto-detection: CUDA if available, CPU otherwise. Override with CUDA_VISIBLE_DEVICES="".

predict(audio_path, ref_path=None)dict[str, float]

Scores a single audio file. Returns raw per-metric scores keyed by metric name.

  • audio_path — path to degraded/test audio
  • ref_path — optional clean reference (required for 22metric)

Raises ValueError if model needs a reference but none was given. Raises FileNotFoundError if paths don't exist.

predict_batch(pairs)list[dict[str, float]]

Batch version. pairs is list[tuple[audio_path, ref_path_or_None]]. Each tuple mirrors predict()'s arguments. Returns a list of score dicts, one per pair.

Implementation detail: When a reference is provided, test and ref audio are resampled to 16 kHz, truncated to the same length, concatenated along the channel dimension, and passed jointly to the model. The model itself decides which metrics to output based on the number of input channels.

Properties:

  • engine.device"cuda" or "cpu"
  • engine.model_name → the alias or repo ID used at init
  • engine.loaded_metrics → list of metric names the loaded model supports

Evaluator — High-Level File & Directory Scoring

from sqa_eval import Evaluator

Wraps InferenceEngine with score aggregation. Designed for scoring one file or one folder of files.

Constructor:

evaluator = Evaluator(model="5metric", weights=None)
Parameter Type Description
model str "5metric", "22metric", "both", or a full HF repo ID
weights dict[str, float] | None Per-metric weight overrides (default: 1.0 each)

When model="both", two engines are loaded: a 22metric model for ref-based evaluation and a 5metric model as fallback when no reference is given.

evaluate_file(audio_path, ref_path=None, system="default")AggregateResult

Scores a single file and returns normalized scores.

Parameter Type Description
audio_path str | Path Path to audio file
ref_path str | Path | None Clean reference (required for 22metric without "both" mode)
system str Label for reports/plots (default: "default")

evaluate_directory(audio_dir, ref_dir=None, recursive=False)list[AggregateResult]

Scores every .wav file in a directory. References are matched by stem with REF_ prefix.

to_csv(results, path) / to_json(results, path)

Exports a list of results to CSV or JSON.


Experiment — Multi-System Comparison

from sqa_eval import Experiment

Orchestrates scoring across multiple systems (subdirectories) and produces a full report.

Constructor:

exp = Experiment(
    name="denoiser-shootout",
    base_dir="./recordings",
    systems=["dnn_v1", "dnn_v2"],
    ref_dir="./clean_refs",
    model="both",
    weights=None,
    output_dir=None,
)
Parameter Type Default Description
name str Experiment name (used as output subdirectory)
base_dir str | Path Parent dir containing one subdir per system
systems list[str] System subdirectory names
ref_dir str | Path | None None Directory with REF_* clean reference files
model str "both" Model alias or HF repo
weights dict | None None Per-metric weight overrides
output_dir str | Path | None None Defaults to <base_dir>/../results/<name>/

run()list[AggregateResult]

Iterates every audio file across all systems, scores each one, and collects results. Prints progress every 10 files.

report()

Generates output files:

  • scores.csv — per-file scores
  • summary.csv — per-system statistics (mean, std, min, max)
  • ranking.csv — systems ordered by mean common score
  • results.json — full structured data
  • bar_common_score.png, box_common_score.png, radar.png, scatter_*_vs_*.png

AggregateResult — Score Output

@dataclass
class AggregateResult:
    file_name: str
    system: str
    model_used: str
    common_score: float      # weighted avg of 5 MOS metrics
    extended_score: float    # weighted avg of all metrics
    raw_scores: dict[str, float]  # per-metric raw scores

Supporting Modules

ScoreAggregator (sqa_eval.aggregator)

Normalizes raw scores into [0, 1] range, handles lower-is-better metrics by flipping (MCD, LSD), and computes weighted averages. The normalize step clamps SDR to [-30, 30] and MCD/LSD to [0, 20].

  • normalize(raw) — maps each known metric to [0, 1] via min-max scaling
  • compute(raw) — weighted average of normalized scores (direction-flipped for lower-is-better metrics)
  • compute_common(raw) — same as compute but only uses the 5 common MOS metrics

Plotter (sqa_eval.plotter)

Generates visualizations using matplotlib:

  • Bar chart — mean ± std per system
  • Box plot — distribution per system
  • Scatter pair — per-file dot plot comparing two systems
  • Radar chart — mean scores per metric across all systems

Reporter (sqa_eval.reporter)

Exports results to DataFrames, CSV, and JSON. Also provides summary_table() and ranking_table() for per-system statistics and merge_reports() to combine multiple experiment results.


Implementation Details

Score Normalization

Each raw metric is normalized to [0, 1] where 1.0 is always best:

  1. Bounded metrics (MOS [1,5]): (value - min) / (max - min)
  2. Unbounded metrics with known range:
    • SDR: clamped to [-30, 30], mapped as (clamped + 30) / 60
    • MCD/LSD: clamped to [0, 20], mapped as clamped / 20
  3. Fully unbounded: clamped directly to [0, 1]
  4. Lower-is-better (MCD, LSD): the normalized value is flipped → 1.0 - norm_val

The Two Scores

Score Metrics Used Purpose
common_score MOS, DNSMOS_OVRL, ScoreQ, UTMOS, NISQA_MOS Reference-free quality estimate
extended_score All metrics the model outputs (5 or 22) Full diagnostic picture

With "5metric", the two scores are identical. With "22metric", extended_score additionally includes SDR, PESQ, PESQ-C2, MCD, LSD, ESTOI, LPS, speaker similarity, semantic similarity, and SIGMOS subscales.

Routing Logic in evaluate_file

The method has four branches:

engine_22 exists + ref provided  → predict with ref using 22metric engine
engine_22 exists + no ref       → fallback to 5metric engine (no-ref metrics only)
single model + ref provided     → predict with ref using loaded model
single model + no ref           → predict without ref using loaded model

For the "22metric" model, calling predict() without a reference raises ValueError.

Reference Convention

Reference files must follow the naming convention REF_<stem>.wav. For a test file sample01.wav, the library looks for REF_sample01.wav in the reference directory. Matching is case-insensitive.

When using Experiment with ref_dir, files without a matching reference are scored with no-ref metrics only (fallback).

File I/O (sqa_eval.io)

Function Purpose
scan_audio(dir, pattern, recursive) Finds all .wav files in a directory
match_references(test_files, ref_dir, prefix) Pairs test files with REF_* references
resolve_experiment(base_dir, systems) Maps system names to their audio files
match_experiment_refs(system_files, ref_dir) Multi-system version of match_references

Use Cases

1. Quick MOS score for a single file

e = Evaluator("5metric")
result = e.evaluate_file("call_recording.wav")
print(f"Speech quality: {result.common_score:.3f}")

2. Compare two denoising algorithms

recordings/
├── algorithm_A/
│   ├── sample01.wav
│   └── sample02.wav
└── algorithm_B/
    ├── sample01.wav
    └── sample02.wav
clean_refs/
├── REF_sample01.wav
└── REF_sample02.wav
exp = Experiment(
    name="denoiser-comparison",
    base_dir="./recordings",
    systems=["algorithm_A", "algorithm_B"],
    ref_dir="./clean_refs",
    model="both",
)
exp.run()
exp.report()

3. Batch inference with raw scores

engine = InferenceEngine("22metric")
pairs = [
    ("output/dnn_v1/sample.wav", "clean/REF_sample.wav"),
    ("output/dnn_v2/sample.wav", "clean/REF_sample.wav"),
]
results = engine.predict_batch(pairs)

4. Custom model from HuggingFace

e = Evaluator(model="username/my-custom-sqa-model")
result = e.evaluate_file("audio.wav")

5. Weighted metrics (favor intelligibility)

e = Evaluator(
    model="22metric",
    weights={"estoi": 3.0, "mos": 0.5},
)
result = e.evaluate_file("audio.wav", ref_path="clean.wav")