Summary
Implement a fully self-hosted transcription pipeline that runs on CPU only using:
- faster-whisper for ASR (speech-to-text)
- pyannote.audio for speaker diarization (speaker segmentation / speaker ID labels)
This feature should produce timestamped transcripts with speaker attribution (e.g., SPEAKER_00, SPEAKER_01) and integrate with existing task/job orchestration.
Why
We need an offline, privacy-preserving transcription stack that does not require external APIs or GPU hardware. faster-whisper (CTranslate2 backend) with INT8 quantization makes CPU inference practical, while pyannote provides speaker diarization.
Key requirements
- 100% self-hosted (no cloud transcription dependencies)
- CPU-only support as first-class path
- Pure Python integration in our backend task pipeline
- Deterministic output artifacts (JSON + optional text formats)
- Graceful handling of long audio and failure/retry behavior
Technical direction
1) ASR engine: faster-whisper on CPU
- Use
WhisperModel(..., device=\"cpu\", compute_type=\"int8\")
- Make model size configurable (default:
small or medium based on benchmark)
- Expose language mode:
- auto-detect (default)
- forced language override
- Return segment-level output with:
start, end, text
- confidence/probability when available
2) Diarization: pyannote
- Run speaker diarization on same audio input
- Produce diarization segments:
- Merge ASR segments with diarization timeline to attach speaker labels to transcript chunks
- Use overlap-based matching strategy (highest overlap wins)
3) Optional comparison mode (non-blocking)
- Add optional experimental switch for WhisperX CPU mode:
device=\"cpu\", compute_type=\"int8\"
- This is for A/B testing and future evaluation only, not required for initial GA path.
Deliverables
- Transcription service module
- Encapsulates model loading, inference, diarization, and merge logic
- Task integration
- Wire into existing async/background task flow
- Persist output artifacts in project output directory
- Output schema
- JSON artifact containing metadata, speaker-attributed transcript segments, and optional detailed timelines
- Configuration surface
- Env/config options for model size, device, compute type, diarization toggles
- Error handling
- Clear exceptions and user-facing task status messages
- Tests
- Unit tests for merge/alignment logic
- Integration test on a short fixture audio file
Proposed output format
{
"metadata": {
"engine": "faster-whisper+pyannote",
"device": "cpu",
"compute_type": "int8",
"asr_model": "small",
"language": "en",
"audio_duration_sec": 312.4,
"processing_time_sec": 128.7
},
"segments": [
{
"id": 0,
"start": 0.52,
"end": 3.41,
"speaker": "SPEAKER_00",
"text": "Hi everyone, thanks for joining.",
"confidence": 0.93,
"words": [
{ "word": "Hi", "start": 0.52, "end": 0.71, "probability": 0.98 },
{ "word": "everyone,", "start": 0.72, "end": 1.18, "probability": 0.95 }
]
},
{
"id": 1,
"start": 3.60,
"end": 6.10,
"speaker": "SPEAKER_01",
"text": "Great, let's start with updates.",
"confidence": 0.90
}
],
"speakers": [
{ "label": "SPEAKER_00", "total_speaking_time_sec": 182.3 },
{ "label": "SPEAKER_01", "total_speaking_time_sec": 119.4 }
],
"diarization": [
{ "start": 0.40, "end": 3.50, "speaker": "SPEAKER_00" },
{ "start": 3.55, "end": 6.30, "speaker": "SPEAKER_01" }
]
}
Schema notes
segments is the primary consumable transcript output.
speaker labels are anonymous (SPEAKER_00, SPEAKER_01, ...), not human identity resolution.
words is optional and included when word-level timestamps/confidence are available.
diarization is optional raw timeline retained for debugging, QA, and future reprocessing.
Implementation checklist (Copilot)
Acceptance criteria
- Running transcription on a CPU-only machine completes successfully for a short sample file
- Output includes speaker-attributed transcript segments with timestamps
- JSON matches the agreed schema (
metadata, segments, optional speakers and diarization)
- Pipeline works end-to-end through existing task execution path
- If diarization fails, task reports clear error (or configurable fallback to transcript-only)
- Unit and integration tests pass in CI
Non-goals (initial release)
- Real-time streaming transcription
- GPU-specific optimizations
- Perfect speaker naming (human identity resolution); labels like
SPEAKER_00 are acceptable
Notes for implementation
- Prefer lazy model initialization and reuse where safe to avoid repeated cold starts
- Guard memory usage for large files (chunking if needed)
- Keep interfaces extensible for future engines/providers
Summary
Implement a fully self-hosted transcription pipeline that runs on CPU only using:
This feature should produce timestamped transcripts with speaker attribution (e.g.,
SPEAKER_00,SPEAKER_01) and integrate with existing task/job orchestration.Why
We need an offline, privacy-preserving transcription stack that does not require external APIs or GPU hardware.
faster-whisper(CTranslate2 backend) with INT8 quantization makes CPU inference practical, whilepyannoteprovides speaker diarization.Key requirements
Technical direction
1) ASR engine: faster-whisper on CPU
WhisperModel(..., device=\"cpu\", compute_type=\"int8\")smallormediumbased on benchmark)start,end,text2) Diarization: pyannote
start,end,speaker3) Optional comparison mode (non-blocking)
device=\"cpu\",compute_type=\"int8\"Deliverables
Proposed output format
{ "metadata": { "engine": "faster-whisper+pyannote", "device": "cpu", "compute_type": "int8", "asr_model": "small", "language": "en", "audio_duration_sec": 312.4, "processing_time_sec": 128.7 }, "segments": [ { "id": 0, "start": 0.52, "end": 3.41, "speaker": "SPEAKER_00", "text": "Hi everyone, thanks for joining.", "confidence": 0.93, "words": [ { "word": "Hi", "start": 0.52, "end": 0.71, "probability": 0.98 }, { "word": "everyone,", "start": 0.72, "end": 1.18, "probability": 0.95 } ] }, { "id": 1, "start": 3.60, "end": 6.10, "speaker": "SPEAKER_01", "text": "Great, let's start with updates.", "confidence": 0.90 } ], "speakers": [ { "label": "SPEAKER_00", "total_speaking_time_sec": 182.3 }, { "label": "SPEAKER_01", "total_speaking_time_sec": 119.4 } ], "diarization": [ { "start": 0.40, "end": 3.50, "speaker": "SPEAKER_00" }, { "start": 3.55, "end": 6.30, "speaker": "SPEAKER_01" } ] }Schema notes
segmentsis the primary consumable transcript output.speakerlabels are anonymous (SPEAKER_00,SPEAKER_01, ...), not human identity resolution.wordsis optional and included when word-level timestamps/confidence are available.diarizationis optional raw timeline retained for debugging, QA, and future reprocessing.Implementation checklist (Copilot)
faster-whisper,pyannote.audio(+ any required audio utils)transcribe_audio(...)function for CPU INT8 pathdiarize_audio(...)functionmerge_transcript_with_speakers(...)with overlap scoringspeakerssummary (total_speaking_time_sec)app/core/tasks.pyor equivalent)Acceptance criteria
metadata,segments, optionalspeakersanddiarization)Non-goals (initial release)
SPEAKER_00are acceptableNotes for implementation