Home

sqa-eval Wiki

Speech Quality Assessment — score your audio with neural MOS metrics and rank enhancement algorithms in minutes, not days.

Overview

sqa-eval is a Python library that wraps Uni-VERSA-Ext HuggingFace models into a simple API for scoring audio files, comparing enhancement algorithms, and exporting results as CSV, JSON, and plots.

API Reference

`InferenceEngine` — Raw Model Predictions

from sqa_eval import InferenceEngine

The lowest-level entry point. Loads a Uni-VERSA-Ext model and runs inference.

Constructor:

engine = InferenceEngine(model="5metric")

Parameter	Type	Description
`model`	`str`	Model alias (`"5metric"`, `"22metric"`) or a full HF repo ID

Device auto-detection: CUDA if available, CPU otherwise. Override with CUDA_VISIBLE_DEVICES="".

predict(audio_path, ref_path=None) → dict[str, float]

Scores a single audio file. Returns raw per-metric scores keyed by metric name.

audio_path — path to degraded/test audio
ref_path — optional clean reference (required for 22metric)

Raises ValueError if model needs a reference but none was given. Raises FileNotFoundError if paths don't exist.

predict_batch(pairs) → list[dict[str, float]]

Batch version. pairs is list[tuple[audio_path, ref_path_or_None]]. Each tuple mirrors predict()'s arguments. Returns a list of score dicts, one per pair.

Implementation detail: When a reference is provided, test and ref audio are resampled to 16 kHz, truncated to the same length, concatenated along the channel dimension, and passed jointly to the model. The model itself decides which metrics to output based on the number of input channels.

Properties:

engine.device → "cuda" or "cpu"
engine.model_name → the alias or repo ID used at init
engine.loaded_metrics → list of metric names the loaded model supports

`Evaluator` — High-Level File & Directory Scoring

from sqa_eval import Evaluator

Wraps InferenceEngine with score aggregation. Designed for scoring one file or one folder of files.

Constructor:

evaluator = Evaluator(model="5metric", weights=None)

Parameter	Type	Description
`model`	`str`	`"5metric"`, `"22metric"`, `"both"`, or a full HF repo ID
`weights`	`dict[str, float] \| None`	Per-metric weight overrides (default: 1.0 each)

When model="both", two engines are loaded: a 22metric model for ref-based evaluation and a 5metric model as fallback when no reference is given.

evaluate_file(audio_path, ref_path=None, system="default") → AggregateResult

Scores a single file and returns normalized scores.

Parameter	Type	Description
`audio_path`	`str \| Path`	Path to audio file
`ref_path`	`str \| Path \| None`	Clean reference (required for 22metric without `"both"` mode)
`system`	`str`	Label for reports/plots (default: `"default"`)

evaluate_directory(audio_dir, ref_dir=None, recursive=False) → list[AggregateResult]

Scores every .wav file in a directory. References are matched by stem with REF_ prefix.

to_csv(results, path) / to_json(results, path)

Exports a list of results to CSV or JSON.

`Experiment` — Multi-System Comparison

from sqa_eval import Experiment

Orchestrates scoring across multiple systems (subdirectories) and produces a full report.

Constructor:

exp = Experiment(
    name="denoiser-shootout",
    base_dir="./recordings",
    systems=["dnn_v1", "dnn_v2"],
    ref_dir="./clean_refs",
    model="both",
    weights=None,
    output_dir=None,
)

Parameter	Type	Default	Description
`name`	`str`	—	Experiment name (used as output subdirectory)
`base_dir`	`str \| Path`	—	Parent dir containing one subdir per system
`systems`	`list[str]`	—	System subdirectory names
`ref_dir`	`str \| Path \| None`	`None`	Directory with `REF_*` clean reference files
`model`	`str`	`"both"`	Model alias or HF repo
`weights`	`dict \| None`	`None`	Per-metric weight overrides
`output_dir`	`str \| Path \| None`	`None`	Defaults to `<base_dir>/../results/<name>/`

run() → list[AggregateResult]

Iterates every audio file across all systems, scores each one, and collects results. Prints progress every 10 files.

report()

Generates output files:

scores.csv — per-file scores
summary.csv — per-system statistics (mean, std, min, max)
ranking.csv — systems ordered by mean common score
results.json — full structured data
bar_common_score.png, box_common_score.png, radar.png, scatter_*_vs_*.png

`AggregateResult` — Score Output

@dataclass
class AggregateResult:
    file_name: str
    system: str
    model_used: str
    common_score: float      # weighted avg of 5 MOS metrics
    extended_score: float    # weighted avg of all metrics
    raw_scores: dict[str, float]  # per-metric raw scores

Supporting Modules

`ScoreAggregator` (`sqa_eval.aggregator`)

Normalizes raw scores into [0, 1] range, handles lower-is-better metrics by flipping (MCD, LSD), and computes weighted averages. The normalize step clamps SDR to [-30, 30] and MCD/LSD to [0, 20].

normalize(raw) — maps each known metric to [0, 1] via min-max scaling
compute(raw) — weighted average of normalized scores (direction-flipped for lower-is-better metrics)
compute_common(raw) — same as compute but only uses the 5 common MOS metrics

`Plotter` (`sqa_eval.plotter`)

Generates visualizations using matplotlib:

Bar chart — mean ± std per system
Box plot — distribution per system
Scatter pair — per-file dot plot comparing two systems
Radar chart — mean scores per metric across all systems

`Reporter` (`sqa_eval.reporter`)

Exports results to DataFrames, CSV, and JSON. Also provides summary_table() and ranking_table() for per-system statistics and merge_reports() to combine multiple experiment results.

Implementation Details

Score Normalization

Each raw metric is normalized to [0, 1] where 1.0 is always best:

Bounded metrics (MOS [1,5]): (value - min) / (max - min)
Unbounded metrics with known range:
- SDR: clamped to [-30, 30], mapped as (clamped + 30) / 60
- MCD/LSD: clamped to [0, 20], mapped as clamped / 20
Fully unbounded: clamped directly to [0, 1]
Lower-is-better (MCD, LSD): the normalized value is flipped → 1.0 - norm_val

The Two Scores

Score	Metrics Used	Purpose
`common_score`	MOS, DNSMOS_OVRL, ScoreQ, UTMOS, NISQA_MOS	Reference-free quality estimate
`extended_score`	All metrics the model outputs (5 or 22)	Full diagnostic picture

With "5metric", the two scores are identical. With "22metric", extended_score additionally includes SDR, PESQ, PESQ-C2, MCD, LSD, ESTOI, LPS, speaker similarity, semantic similarity, and SIGMOS subscales.

Routing Logic in `evaluate_file`

The method has four branches:

engine_22 exists + ref provided  → predict with ref using 22metric engine
engine_22 exists + no ref       → fallback to 5metric engine (no-ref metrics only)
single model + ref provided     → predict with ref using loaded model
single model + no ref           → predict without ref using loaded model

For the "22metric" model, calling predict() without a reference raises ValueError.

Reference Convention

Reference files must follow the naming convention REF_<stem>.wav. For a test file sample01.wav, the library looks for REF_sample01.wav in the reference directory. Matching is case-insensitive.

When using Experiment with ref_dir, files without a matching reference are scored with no-ref metrics only (fallback).

File I/O (`sqa_eval.io`)

Function	Purpose
`scan_audio(dir, pattern, recursive)`	Finds all `.wav` files in a directory
`match_references(test_files, ref_dir, prefix)`	Pairs test files with `REF_*` references
`resolve_experiment(base_dir, systems)`	Maps system names to their audio files
`match_experiment_refs(system_files, ref_dir)`	Multi-system version of `match_references`

Use Cases

1. Quick MOS score for a single file

e = Evaluator("5metric")
result = e.evaluate_file("call_recording.wav")
print(f"Speech quality: {result.common_score:.3f}")

2. Compare two denoising algorithms

recordings/
├── algorithm_A/
│   ├── sample01.wav
│   └── sample02.wav
└── algorithm_B/
    ├── sample01.wav
    └── sample02.wav
clean_refs/
├── REF_sample01.wav
└── REF_sample02.wav

exp = Experiment(
    name="denoiser-comparison",
    base_dir="./recordings",
    systems=["algorithm_A", "algorithm_B"],
    ref_dir="./clean_refs",
    model="both",
)
exp.run()
exp.report()

3. Batch inference with raw scores

engine = InferenceEngine("22metric")
pairs = [
    ("output/dnn_v1/sample.wav", "clean/REF_sample.wav"),
    ("output/dnn_v2/sample.wav", "clean/REF_sample.wav"),
]
results = engine.predict_batch(pairs)

4. Custom model from HuggingFace

e = Evaluator(model="username/my-custom-sqa-model")
result = e.evaluate_file("audio.wav")

5. Weighted metrics (favor intelligibility)

e = Evaluator(
    model="22metric",
    weights={"estoi": 3.0, "mos": 0.5},
)
result = e.evaluate_file("audio.wav", ref_path="clean.wav")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

sqa-eval Wiki

Overview

API Reference

`InferenceEngine` — Raw Model Predictions

`Evaluator` — High-Level File & Directory Scoring

`Experiment` — Multi-System Comparison

`AggregateResult` — Score Output

Supporting Modules

`ScoreAggregator` (`sqa_eval.aggregator`)

`Plotter` (`sqa_eval.plotter`)

`Reporter` (`sqa_eval.reporter`)

Implementation Details

Score Normalization

The Two Scores

Routing Logic in `evaluate_file`

Reference Convention

File I/O (`sqa_eval.io`)

Use Cases

1. Quick MOS score for a single file

2. Compare two denoising algorithms

3. Batch inference with raw scores

4. Custom model from HuggingFace

5. Weighted metrics (favor intelligibility)

Clone this wiki locally

Home

sqa-eval Wiki

Overview

API Reference

InferenceEngine — Raw Model Predictions

Evaluator — High-Level File & Directory Scoring

Experiment — Multi-System Comparison

AggregateResult — Score Output

Supporting Modules

ScoreAggregator (sqa_eval.aggregator)

Plotter (sqa_eval.plotter)

Reporter (sqa_eval.reporter)

Implementation Details

Score Normalization

The Two Scores

Routing Logic in evaluate_file

Reference Convention

File I/O (sqa_eval.io)

Use Cases

1. Quick MOS score for a single file

2. Compare two denoising algorithms

3. Batch inference with raw scores

4. Custom model from HuggingFace

5. Weighted metrics (favor intelligibility)

Clone this wiki locally

`InferenceEngine` — Raw Model Predictions

`Evaluator` — High-Level File & Directory Scoring

`Experiment` — Multi-System Comparison

`AggregateResult` — Score Output

`ScoreAggregator` (`sqa_eval.aggregator`)

`Plotter` (`sqa_eval.plotter`)

`Reporter` (`sqa_eval.reporter`)

Routing Logic in `evaluate_file`

File I/O (`sqa_eval.io`)