A standalone tool for evaluating Automatic Speech Recognition (ASR) models, particularly optimized for medical/clinical speech recognition, using Word Error Rate (WER) metric. This tool supports evaluation via API endpoints or direct HuggingFace model inference.
asr speech-recognition medical-asr clinical-asr wer word-error-rate evaluation benchmarking huggingface audio-processing speech-to-text transcription medical-ai healthcare-ai nlp machine-learning python api-evaluation model-evaluation speech-evaluation audio-evaluation medical-transcription clinical-transcription openai-api parallel-processing performance-metrics
- Flexible Evaluation: Evaluate models via API endpoints or directly using HuggingFace models
- Comprehensive Metrics: Calculate WER along with timing statistics, real-time factors, and detailed per-sample results
- Parallel Processing: Support for concurrent requests to speed up evaluation
- Multiple Dataset Formats: Works with any HuggingFace dataset containing audio and text columns
- Detailed Reporting: Export detailed results in JSON format for further analysis
We recommend using a conda environment with Python 3.11 for optimal compatibility.
# Clone the repository
git clone https://github.com/riedemannai/Medical_ASR_Evaluator.git
cd Medical_ASR_Evaluator
# Create and activate conda environment with Python 3.11
conda create -n medical_asr_evaluator python=3.11 -y
conda activate medical_asr_evaluator
# Install dependencies
pip install -r requirements.txtAlternatively, you can use the provided environment.yml file:
# Create environment from file
conda env create -f environment.yml
conda activate medical_asr_evaluator# Clone the repository
git clone https://github.com/riedemannai/Medical_ASR_Evaluator.git
cd Medical_ASR_Evaluator
# Install dependencies
pip install -r requirements.txtIf you have an ASR server running (e.g., OpenAI-compatible API):
python wer_evaluator.py \
--dataset NeurologyAI/neuro-whisper-v1 \
--split validation \
--api-url http://localhost:8002 \
--output results.jsonπ‘ Suggested ASR Server: For medical/clinical ASR evaluation, we recommend using parakeet-mlx-server - an OpenAI-compatible FastAPI server optimized for German neurology and neuro-oncology audio transcription using Parakeet-MLX on Apple Silicon.
If you want to evaluate a model directly without an API:
python wer_evaluator.py \
--dataset NeurologyAI/neuro-whisper-v1 \
--split validation \
--model NeurologyAI/neuro-parakeet-mlx \
--output results.jsonπ Tested Configuration: This example uses the tested model NeurologyAI/neuro-parakeet-mlx evaluated on the NeurologyAI/neuro-whisper-v1 dataset, achieving a WER of 1.04% on the validation split (5,289 samples).
For quick testing, limit the number of samples:
python wer_evaluator.py \
--dataset NeurologyAI/neuro-whisper-v1 \
--split validation \
--api-url http://localhost:8002 \
--limit 100Speed up evaluation by processing multiple samples concurrently:
python wer_evaluator.py \
--dataset NeurologyAI/neuro-whisper-v1 \
--split validation \
--api-url http://localhost:8002 \
--batch-size 4--dataset: HuggingFace dataset name (required)--split: Dataset split to evaluate (default:validation)--api-url: ASR API base URL (e.g.,http://localhost:8002). Required if--modelis not provided.--model: HuggingFace model name (e.g.,NeurologyAI/neuro-parakeet). Required if--api-urlis not provided.--language: Language code for transcription (default:de)--limit: Limit number of samples to evaluate (default: all)--output: Output file for detailed results (JSON format)--audio-column: Name of the audio column in the dataset (default:audio)--text-column: Name of the transcription column in the dataset (default:transcription)--batch-size: Number of concurrent requests for parallel processing (default:1)
The tool provides:
-
Console Output: Real-time progress and summary statistics including:
- Sample statistics (total, valid, failed)
- Timing statistics (evaluation time, inference time, real-time factor)
- Word Error Rate (WER)
-
JSON Output (if
--outputis specified): Detailed results including:- Overall WER and statistics
- Per-sample predictions and references
- Timing information for each sample
- Error counts and failed samples
============================================================
Computing Word Error Rate (WER)...
============================================================
============================================================
EVALUATION RESULTS
============================================================
π Sample Statistics:
Total samples: 5289
Valid samples: 5289
Failed samples: 0
Success rate: 100.00%
β±οΈ Timing Statistics:
Total evaluation time: 953.26s (15.89 min)
Total audio duration: 22786.68s (379.78 min, 6.33 hours)
Total inference time: 944.00s (15.73 min)
Average inference time per sample: 0.178s
Real-time factor (RTF): 0.041x
Processing rate: 5.55 samples/s
============================================================
π― Word Error Rate (WER): 1.04%
============================================================
β Detailed results saved to: results.json
The tool expects an OpenAI-compatible transcription API endpoint:
- Endpoint:
POST /v1/audio/transcriptions - Request: Multipart form data with
file(audio file) and optionalmodel,language,response_format - Response: JSON with
textfield containing the transcription
Recommended Server: parakeet-mlx-server - OpenAI-compatible FastAPI server for German neurology and neuro-oncology audio transcription, optimized for Apple Silicon M4.
Example API request:
files = {'file': ('audio.wav', audio_bytes, 'audio/wav')}
data = {'model': 'parakeet-tdt-0.6b-v3', 'language': 'de', 'response_format': 'json'}
response = requests.post('http://localhost:8002/v1/audio/transcriptions', files=files, data=data)The tool automatically cleans text for WER computation by:
- Converting to lowercase
- Stripping whitespace
- Removing trailing punctuation (
.,,,!,?)
This normalization ensures fair comparison between predictions and references.
- Python 3.8+ (Python 3.11 recommended)
- See
requirements.txtfor package dependencies - We recommend using a conda environment with Python 3.11 for optimal compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research, please cite:
@software{medical_asr_evaluator,
title = {Medical ASR Evaluator},
author = {Riedemann, Lars},
year = {2026},
url = {https://github.com/riedemannai/Medical_ASR_Evaluator}
}