-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Speech Quality Assessment — score your audio with neural MOS metrics and rank enhancement algorithms in minutes, not days.
sqa-eval is a Python library that wraps Uni-VERSA-Ext HuggingFace models into a simple API for scoring audio files, comparing enhancement algorithms, and exporting results as CSV, JSON, and plots.
from sqa_eval import InferenceEngine
The lowest-level entry point. Loads a Uni-VERSA-Ext model and runs inference.
Constructor:
engine = InferenceEngine(model="5metric")
| Parameter | Type | Description |
|---|---|---|
model |
str |
Model alias ("5metric", "22metric") or a full HF repo ID |
Device auto-detection: CUDA if available, CPU otherwise. Override with CUDA_VISIBLE_DEVICES="".
predict(audio_path, ref_path=None) → dict[str, float]
Scores a single audio file. Returns raw per-metric scores keyed by metric name.
-
audio_path— path to degraded/test audio -
ref_path— optional clean reference (required for 22metric)
Raises ValueError if model needs a reference but none was given. Raises FileNotFoundError if paths don't exist.
predict_batch(pairs) → list[dict[str, float]]
Batch version. pairs is list[tuple[audio_path, ref_path_or_None]]. Each tuple mirrors predict()'s arguments. Returns a list of score dicts, one per pair.
Implementation detail: When a reference is provided, test and ref audio are resampled to 16 kHz, truncated to the same length, concatenated along the channel dimension, and passed jointly to the model. The model itself decides which metrics to output based on the number of input channels.
Properties:
-
engine.device→"cuda"or"cpu" -
engine.model_name→ the alias or repo ID used at init -
engine.loaded_metrics→ list of metric names the loaded model supports
from sqa_eval import Evaluator
Wraps InferenceEngine with score aggregation. Designed for scoring one file or one folder of files.
Constructor:
evaluator = Evaluator(model="5metric", weights=None)
| Parameter | Type | Description |
|---|---|---|
model |
str |
"5metric", "22metric", "both", or a full HF repo ID |
weights |
dict[str, float] | None |
Per-metric weight overrides (default: 1.0 each) |
When model="both", two engines are loaded: a 22metric model for ref-based evaluation and a 5metric model as fallback when no reference is given.
evaluate_file(audio_path, ref_path=None, system="default") → AggregateResult
Scores a single file and returns normalized scores.
| Parameter | Type | Description |
|---|---|---|
audio_path |
str | Path |
Path to audio file |
ref_path |
str | Path | None |
Clean reference (required for 22metric without "both" mode) |
system |
str |
Label for reports/plots (default: "default") |
evaluate_directory(audio_dir, ref_dir=None, recursive=False) → list[AggregateResult]
Scores every .wav file in a directory. References are matched by stem with REF_ prefix.
to_csv(results, path) / to_json(results, path)
Exports a list of results to CSV or JSON.
from sqa_eval import Experiment
Orchestrates scoring across multiple systems (subdirectories) and produces a full report.
Constructor:
exp = Experiment(
name="denoiser-shootout",
base_dir="./recordings",
systems=["dnn_v1", "dnn_v2"],
ref_dir="./clean_refs",
model="both",
weights=None,
output_dir=None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
— | Experiment name (used as output subdirectory) |
base_dir |
str | Path |
— | Parent dir containing one subdir per system |
systems |
list[str] |
— | System subdirectory names |
ref_dir |
str | Path | None |
None |
Directory with REF_* clean reference files |
model |
str |
"both" |
Model alias or HF repo |
weights |
dict | None |
None |
Per-metric weight overrides |
output_dir |
str | Path | None |
None |
Defaults to <base_dir>/../results/<name>/
|
run() → list[AggregateResult]
Iterates every audio file across all systems, scores each one, and collects results. Prints progress every 10 files.
report()
Generates output files:
-
scores.csv— per-file scores -
summary.csv— per-system statistics (mean, std, min, max) -
ranking.csv— systems ordered by mean common score -
results.json— full structured data -
bar_common_score.png,box_common_score.png,radar.png,scatter_*_vs_*.png
@dataclass
class AggregateResult:
file_name: str
system: str
model_used: str
common_score: float # weighted avg of 5 MOS metrics
extended_score: float # weighted avg of all metrics
raw_scores: dict[str, float] # per-metric raw scores
Normalizes raw scores into [0, 1] range, handles lower-is-better metrics by flipping (MCD, LSD), and computes weighted averages. The normalize step clamps SDR to [-30, 30] and MCD/LSD to [0, 20].
-
normalize(raw)— maps each known metric to[0, 1]via min-max scaling -
compute(raw)— weighted average of normalized scores (direction-flipped for lower-is-better metrics) -
compute_common(raw)— same ascomputebut only uses the 5 common MOS metrics
Generates visualizations using matplotlib:
- Bar chart — mean ± std per system
- Box plot — distribution per system
- Scatter pair — per-file dot plot comparing two systems
- Radar chart — mean scores per metric across all systems
Exports results to DataFrames, CSV, and JSON. Also provides summary_table() and ranking_table() for per-system statistics and merge_reports() to combine multiple experiment results.
Each raw metric is normalized to [0, 1] where 1.0 is always best:
-
Bounded metrics (MOS
[1,5]):(value - min) / (max - min) -
Unbounded metrics with known range:
- SDR: clamped to
[-30, 30], mapped as(clamped + 30) / 60 - MCD/LSD: clamped to
[0, 20], mapped asclamped / 20
- SDR: clamped to
-
Fully unbounded: clamped directly to
[0, 1] -
Lower-is-better (MCD, LSD): the normalized value is flipped →
1.0 - norm_val
| Score | Metrics Used | Purpose |
|---|---|---|
common_score |
MOS, DNSMOS_OVRL, ScoreQ, UTMOS, NISQA_MOS | Reference-free quality estimate |
extended_score |
All metrics the model outputs (5 or 22) | Full diagnostic picture |
With "5metric", the two scores are identical. With "22metric", extended_score additionally includes SDR, PESQ, PESQ-C2, MCD, LSD, ESTOI, LPS, speaker similarity, semantic similarity, and SIGMOS subscales.
The method has four branches:
engine_22 exists + ref provided → predict with ref using 22metric engine
engine_22 exists + no ref → fallback to 5metric engine (no-ref metrics only)
single model + ref provided → predict with ref using loaded model
single model + no ref → predict without ref using loaded model
For the "22metric" model, calling predict() without a reference raises ValueError.
Reference files must follow the naming convention REF_<stem>.wav. For a test file sample01.wav, the library looks for REF_sample01.wav in the reference directory. Matching is case-insensitive.
When using Experiment with ref_dir, files without a matching reference are scored with no-ref metrics only (fallback).
| Function | Purpose |
|---|---|
scan_audio(dir, pattern, recursive) |
Finds all .wav files in a directory |
match_references(test_files, ref_dir, prefix) |
Pairs test files with REF_* references |
resolve_experiment(base_dir, systems) |
Maps system names to their audio files |
match_experiment_refs(system_files, ref_dir) |
Multi-system version of match_references
|
e = Evaluator("5metric")
result = e.evaluate_file("call_recording.wav")
print(f"Speech quality: {result.common_score:.3f}")
recordings/
├── algorithm_A/
│ ├── sample01.wav
│ └── sample02.wav
└── algorithm_B/
├── sample01.wav
└── sample02.wav
clean_refs/
├── REF_sample01.wav
└── REF_sample02.wav
exp = Experiment(
name="denoiser-comparison",
base_dir="./recordings",
systems=["algorithm_A", "algorithm_B"],
ref_dir="./clean_refs",
model="both",
)
exp.run()
exp.report()
engine = InferenceEngine("22metric")
pairs = [
("output/dnn_v1/sample.wav", "clean/REF_sample.wav"),
("output/dnn_v2/sample.wav", "clean/REF_sample.wav"),
]
results = engine.predict_batch(pairs)
e = Evaluator(model="username/my-custom-sqa-model")
result = e.evaluate_file("audio.wav")
e = Evaluator(
model="22metric",
weights={"estoi": 3.0, "mos": 0.5},
)
result = e.evaluate_file("audio.wav", ref_path="clean.wav")