🎤🎯 Medical ASR Evaluator

A standalone tool for evaluating Automatic Speech Recognition (ASR) models, particularly optimized for medical/clinical speech recognition, using Word Error Rate (WER) metric. This tool supports evaluation via API endpoints or direct HuggingFace model inference.

Features

Flexible Evaluation: Evaluate models via API endpoints or directly using HuggingFace models
Comprehensive Metrics: Calculate WER along with timing statistics, real-time factors, and detailed per-sample results
Parallel Processing: Support for concurrent requests to speed up evaluation
Multiple Dataset Formats: Works with any HuggingFace dataset containing audio and text columns
Detailed Reporting: Export detailed results in JSON format for further analysis

Installation

We recommend using a conda environment with Python 3.11 for optimal compatibility.

Using Conda (Recommended)

# Clone the repository
git clone https://github.com/riedemannai/Medical_ASR_Evaluator.git
cd Medical_ASR_Evaluator

# Create and activate conda environment with Python 3.11
conda create -n medical_asr_evaluator python=3.11 -y
conda activate medical_asr_evaluator

# Install dependencies
pip install -r requirements.txt

Alternatively, you can use the provided environment.yml file:

# Create environment from file
conda env create -f environment.yml
conda activate medical_asr_evaluator

Using pip (Alternative)

# Clone the repository
git clone https://github.com/riedemannai/Medical_ASR_Evaluator.git
cd Medical_ASR_Evaluator

# Install dependencies
pip install -r requirements.txt

Usage

Evaluate via API Endpoint

If you have an ASR server running (e.g., OpenAI-compatible API):

python wer_evaluator.py \
    --dataset NeurologyAI/neuro-whisper-v1 \
    --split validation \
    --api-url http://localhost:8002 \
    --output results.json

💡 Suggested ASR Server: For medical/clinical ASR evaluation, we recommend using parakeet-mlx-server - an OpenAI-compatible FastAPI server optimized for German neurology and neuro-oncology audio transcription using Parakeet-MLX on Apple Silicon.

Evaluate with HuggingFace Model Directly

If you want to evaluate a model directly without an API:

python wer_evaluator.py \
    --dataset NeurologyAI/neuro-whisper-v1 \
    --split validation \
    --model NeurologyAI/neuro-parakeet-mlx \
    --output results.json

📊 Tested Configuration: This example uses the tested model NeurologyAI/neuro-parakeet-mlx evaluated on the NeurologyAI/neuro-whisper-v1 dataset, achieving a WER of 1.04% on the validation split (5,289 samples).

Quick Test with Limited Samples

For quick testing, limit the number of samples:

python wer_evaluator.py \
    --dataset NeurologyAI/neuro-whisper-v1 \
    --split validation \
    --api-url http://localhost:8002 \
    --limit 100

Parallel Processing

Speed up evaluation by processing multiple samples concurrently:

python wer_evaluator.py \
    --dataset NeurologyAI/neuro-whisper-v1 \
    --split validation \
    --api-url http://localhost:8002 \
    --batch-size 4

Command Line Arguments

--dataset: HuggingFace dataset name (required)
--split: Dataset split to evaluate (default: validation)
--api-url: ASR API base URL (e.g., http://localhost:8002). Required if --model is not provided.
--model: HuggingFace model name (e.g., NeurologyAI/neuro-parakeet). Required if --api-url is not provided.
--language: Language code for transcription (default: de)
--limit: Limit number of samples to evaluate (default: all)
--output: Output file for detailed results (JSON format)
--audio-column: Name of the audio column in the dataset (default: audio)
--text-column: Name of the transcription column in the dataset (default: transcription)
--batch-size: Number of concurrent requests for parallel processing (default: 1)

Output

The tool provides:

Console Output: Real-time progress and summary statistics including:
- Sample statistics (total, valid, failed)
- Timing statistics (evaluation time, inference time, real-time factor)
- Word Error Rate (WER)
JSON Output (if --output is specified): Detailed results including:
- Overall WER and statistics
- Per-sample predictions and references
- Timing information for each sample
- Error counts and failed samples

Example Output

============================================================
Computing Word Error Rate (WER)...
============================================================

============================================================
EVALUATION RESULTS
============================================================

📊 Sample Statistics:
  Total samples: 5289
  Valid samples: 5289
  Failed samples: 0
  Success rate: 100.00%

⏱️  Timing Statistics:
  Total evaluation time: 953.26s (15.89 min)
  Total audio duration: 22786.68s (379.78 min, 6.33 hours)
  Total inference time: 944.00s (15.73 min)
  Average inference time per sample: 0.178s
  Real-time factor (RTF): 0.041x
  Processing rate: 5.55 samples/s

============================================================
🎯 Word Error Rate (WER): 1.04%
============================================================

✓ Detailed results saved to: results.json

API Compatibility

The tool expects an OpenAI-compatible transcription API endpoint:

Endpoint: POST /v1/audio/transcriptions
Request: Multipart form data with file (audio file) and optional model, language, response_format
Response: JSON with text field containing the transcription

Recommended Server: parakeet-mlx-server - OpenAI-compatible FastAPI server for German neurology and neuro-oncology audio transcription, optimized for Apple Silicon M4.

Example API request:

files = {'file': ('audio.wav', audio_bytes, 'audio/wav')}
data = {'model': 'parakeet-tdt-0.6b-v3', 'language': 'de', 'response_format': 'json'}
response = requests.post('http://localhost:8002/v1/audio/transcriptions', files=files, data=data)

Text Cleaning

The tool automatically cleans text for WER computation by:

Converting to lowercase
Stripping whitespace
Removing trailing punctuation (., ,, !, ?)

This normalization ensures fair comparison between predictions and references.

Requirements

Python 3.8+ (Python 3.11 recommended)
See requirements.txt for package dependencies
We recommend using a conda environment with Python 3.11 for optimal compatibility

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

@software{medical_asr_evaluator,
  title = {Medical ASR Evaluator},
  author = {Riedemann, Lars},
  year = {2026},
  url = {https://github.com/riedemannai/Medical_ASR_Evaluator}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
example_usage.py		example_usage.py
publish_to_github.sh		publish_to_github.sh
requirements.txt		requirements.txt
wer_evaluator.py		wer_evaluator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎤🎯 Medical ASR Evaluator

Tags

Features

Installation

Using Conda (Recommended)

Using pip (Alternative)

Usage

Evaluate via API Endpoint

Evaluate with HuggingFace Model Directly

Quick Test with Limited Samples

Parallel Processing

Command Line Arguments

Output

Example Output

API Compatibility

Text Cleaning

Requirements

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎤🎯 Medical ASR Evaluator

Tags

Features

Installation

Using Conda (Recommended)

Using pip (Alternative)

Usage

Evaluate via API Endpoint

Evaluate with HuggingFace Model Directly

Quick Test with Limited Samples

Parallel Processing

Command Line Arguments

Output

Example Output

API Compatibility

Text Cleaning

Requirements

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages