Skip to content

Feature: transcription + speaker diarization #29

@younes200

Description

@younes200

Summary

Implement a fully self-hosted transcription pipeline that runs on CPU only using:

  • faster-whisper for ASR (speech-to-text)
  • pyannote.audio for speaker diarization (speaker segmentation / speaker ID labels)

This feature should produce timestamped transcripts with speaker attribution (e.g., SPEAKER_00, SPEAKER_01) and integrate with existing task/job orchestration.

Why

We need an offline, privacy-preserving transcription stack that does not require external APIs or GPU hardware. faster-whisper (CTranslate2 backend) with INT8 quantization makes CPU inference practical, while pyannote provides speaker diarization.

Key requirements

  • 100% self-hosted (no cloud transcription dependencies)
  • CPU-only support as first-class path
  • Pure Python integration in our backend task pipeline
  • Deterministic output artifacts (JSON + optional text formats)
  • Graceful handling of long audio and failure/retry behavior

Technical direction

1) ASR engine: faster-whisper on CPU

  • Use WhisperModel(..., device=\"cpu\", compute_type=\"int8\")
  • Make model size configurable (default: small or medium based on benchmark)
  • Expose language mode:
    • auto-detect (default)
    • forced language override
  • Return segment-level output with:
    • start, end, text
    • confidence/probability when available

2) Diarization: pyannote

  • Run speaker diarization on same audio input
  • Produce diarization segments:
    • start, end, speaker
  • Merge ASR segments with diarization timeline to attach speaker labels to transcript chunks
  • Use overlap-based matching strategy (highest overlap wins)

3) Optional comparison mode (non-blocking)

  • Add optional experimental switch for WhisperX CPU mode:
    • device=\"cpu\", compute_type=\"int8\"
  • This is for A/B testing and future evaluation only, not required for initial GA path.

Deliverables

  1. Transcription service module
    • Encapsulates model loading, inference, diarization, and merge logic
  2. Task integration
    • Wire into existing async/background task flow
    • Persist output artifacts in project output directory
  3. Output schema
    • JSON artifact containing metadata, speaker-attributed transcript segments, and optional detailed timelines
  4. Configuration surface
    • Env/config options for model size, device, compute type, diarization toggles
  5. Error handling
    • Clear exceptions and user-facing task status messages
  6. Tests
    • Unit tests for merge/alignment logic
    • Integration test on a short fixture audio file

Proposed output format

{
  "metadata": {
    "engine": "faster-whisper+pyannote",
    "device": "cpu",
    "compute_type": "int8",
    "asr_model": "small",
    "language": "en",
    "audio_duration_sec": 312.4,
    "processing_time_sec": 128.7
  },
  "segments": [
    {
      "id": 0,
      "start": 0.52,
      "end": 3.41,
      "speaker": "SPEAKER_00",
      "text": "Hi everyone, thanks for joining.",
      "confidence": 0.93,
      "words": [
        { "word": "Hi", "start": 0.52, "end": 0.71, "probability": 0.98 },
        { "word": "everyone,", "start": 0.72, "end": 1.18, "probability": 0.95 }
      ]
    },
    {
      "id": 1,
      "start": 3.60,
      "end": 6.10,
      "speaker": "SPEAKER_01",
      "text": "Great, let's start with updates.",
      "confidence": 0.90
    }
  ],
  "speakers": [
    { "label": "SPEAKER_00", "total_speaking_time_sec": 182.3 },
    { "label": "SPEAKER_01", "total_speaking_time_sec": 119.4 }
  ],
  "diarization": [
    { "start": 0.40, "end": 3.50, "speaker": "SPEAKER_00" },
    { "start": 3.55, "end": 6.30, "speaker": "SPEAKER_01" }
  ]
}

Schema notes

  • segments is the primary consumable transcript output.
  • speaker labels are anonymous (SPEAKER_00, SPEAKER_01, ...), not human identity resolution.
  • words is optional and included when word-level timestamps/confidence are available.
  • diarization is optional raw timeline retained for debugging, QA, and future reprocessing.

Implementation checklist (Copilot)

  • Add dependencies for faster-whisper, pyannote.audio (+ any required audio utils)
  • Add config/env keys for transcription and diarization settings
  • Implement transcribe_audio(...) function for CPU INT8 path
  • Implement diarize_audio(...) function
  • Implement merge_transcript_with_speakers(...) with overlap scoring
  • Add speaker aggregation helper to compute speakers summary (total_speaking_time_sec)
  • Integrate pipeline into existing task orchestration (app/core/tasks.py or equivalent)
  • Persist transcript artifact to outputs directory
  • Add structured logs (timings + model settings)
  • Add tests for alignment edge cases (crossing boundaries, gaps, overlaps)
  • Add docs/README section with usage and performance notes

Acceptance criteria

  • Running transcription on a CPU-only machine completes successfully for a short sample file
  • Output includes speaker-attributed transcript segments with timestamps
  • JSON matches the agreed schema (metadata, segments, optional speakers and diarization)
  • Pipeline works end-to-end through existing task execution path
  • If diarization fails, task reports clear error (or configurable fallback to transcript-only)
  • Unit and integration tests pass in CI

Non-goals (initial release)

  • Real-time streaming transcription
  • GPU-specific optimizations
  • Perfect speaker naming (human identity resolution); labels like SPEAKER_00 are acceptable

Notes for implementation

  • Prefer lazy model initialization and reuse where safe to avoid repeated cold starts
  • Guard memory usage for large files (chunking if needed)
  • Keep interfaces extensible for future engines/providers

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions