Feature:  transcription + speaker diarization

## Summary
Implement a fully self-hosted transcription pipeline that runs on **CPU only** using:
- **faster-whisper** for ASR (speech-to-text)
- **pyannote.audio** for speaker diarization (speaker segmentation / speaker ID labels)

This feature should produce timestamped transcripts with speaker attribution (e.g., `SPEAKER_00`, `SPEAKER_01`) and integrate with existing task/job orchestration.

## Why
We need an offline, privacy-preserving transcription stack that does not require external APIs or GPU hardware. `faster-whisper` (CTranslate2 backend) with INT8 quantization makes CPU inference practical, while `pyannote` provides speaker diarization.

## Key requirements
- 100% self-hosted (no cloud transcription dependencies)
- CPU-only support as first-class path
- Pure Python integration in our backend task pipeline
- Deterministic output artifacts (JSON + optional text formats)
- Graceful handling of long audio and failure/retry behavior

## Technical direction
### 1) ASR engine: faster-whisper on CPU
- Use `WhisperModel(..., device=\"cpu\", compute_type=\"int8\")`
- Make model size configurable (default: `small` or `medium` based on benchmark)
- Expose language mode:
  - auto-detect (default)
  - forced language override
- Return segment-level output with:
  - `start`, `end`, `text`
  - confidence/probability when available

### 2) Diarization: pyannote
- Run speaker diarization on same audio input
- Produce diarization segments:
  - `start`, `end`, `speaker`
- Merge ASR segments with diarization timeline to attach speaker labels to transcript chunks
- Use overlap-based matching strategy (highest overlap wins)

### 3) Optional comparison mode (non-blocking)
- Add optional experimental switch for **WhisperX CPU mode**:
  - `device=\"cpu\"`, `compute_type=\"int8\"`
- This is for A/B testing and future evaluation only, not required for initial GA path.

## Deliverables
1. **Transcription service module**
   - Encapsulates model loading, inference, diarization, and merge logic
2. **Task integration**
   - Wire into existing async/background task flow
   - Persist output artifacts in project output directory
3. **Output schema**
   - JSON artifact containing metadata, speaker-attributed transcript segments, and optional detailed timelines
4. **Configuration surface**
   - Env/config options for model size, device, compute type, diarization toggles
5. **Error handling**
   - Clear exceptions and user-facing task status messages
6. **Tests**
   - Unit tests for merge/alignment logic
   - Integration test on a short fixture audio file

## Proposed output format
```json
{
  "metadata": {
    "engine": "faster-whisper+pyannote",
    "device": "cpu",
    "compute_type": "int8",
    "asr_model": "small",
    "language": "en",
    "audio_duration_sec": 312.4,
    "processing_time_sec": 128.7
  },
  "segments": [
    {
      "id": 0,
      "start": 0.52,
      "end": 3.41,
      "speaker": "SPEAKER_00",
      "text": "Hi everyone, thanks for joining.",
      "confidence": 0.93,
      "words": [
        { "word": "Hi", "start": 0.52, "end": 0.71, "probability": 0.98 },
        { "word": "everyone,", "start": 0.72, "end": 1.18, "probability": 0.95 }
      ]
    },
    {
      "id": 1,
      "start": 3.60,
      "end": 6.10,
      "speaker": "SPEAKER_01",
      "text": "Great, let's start with updates.",
      "confidence": 0.90
    }
  ],
  "speakers": [
    { "label": "SPEAKER_00", "total_speaking_time_sec": 182.3 },
    { "label": "SPEAKER_01", "total_speaking_time_sec": 119.4 }
  ],
  "diarization": [
    { "start": 0.40, "end": 3.50, "speaker": "SPEAKER_00" },
    { "start": 3.55, "end": 6.30, "speaker": "SPEAKER_01" }
  ]
}
```

### Schema notes
- `segments` is the primary consumable transcript output.
- `speaker` labels are anonymous (`SPEAKER_00`, `SPEAKER_01`, ...), not human identity resolution.
- `words` is optional and included when word-level timestamps/confidence are available.
- `diarization` is optional raw timeline retained for debugging, QA, and future reprocessing.

## Implementation checklist (Copilot)
- [ ] Add dependencies for `faster-whisper`, `pyannote.audio` (+ any required audio utils)
- [ ] Add config/env keys for transcription and diarization settings
- [ ] Implement `transcribe_audio(...)` function for CPU INT8 path
- [ ] Implement `diarize_audio(...)` function
- [ ] Implement `merge_transcript_with_speakers(...)` with overlap scoring
- [ ] Add speaker aggregation helper to compute `speakers` summary (`total_speaking_time_sec`)
- [ ] Integrate pipeline into existing task orchestration (`app/core/tasks.py` or equivalent)
- [ ] Persist transcript artifact to outputs directory
- [ ] Add structured logs (timings + model settings)
- [ ] Add tests for alignment edge cases (crossing boundaries, gaps, overlaps)
- [ ] Add docs/README section with usage and performance notes

## Acceptance criteria
- Running transcription on a CPU-only machine completes successfully for a short sample file
- Output includes speaker-attributed transcript segments with timestamps
- JSON matches the agreed schema (`metadata`, `segments`, optional `speakers` and `diarization`)
- Pipeline works end-to-end through existing task execution path
- If diarization fails, task reports clear error (or configurable fallback to transcript-only)
- Unit and integration tests pass in CI

## Non-goals (initial release)
- Real-time streaming transcription
- GPU-specific optimizations
- Perfect speaker naming (human identity resolution); labels like `SPEAKER_00` are acceptable

## Notes for implementation
- Prefer lazy model initialization and reuse where safe to avoid repeated cold starts
- Guard memory usage for large files (chunking if needed)
- Keep interfaces extensible for future engines/providers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: transcription + speaker diarization #29

Summary

Why

Key requirements

Technical direction

1) ASR engine: faster-whisper on CPU

2) Diarization: pyannote

3) Optional comparison mode (non-blocking)

Deliverables

Proposed output format

Schema notes

Implementation checklist (Copilot)

Acceptance criteria

Non-goals (initial release)

Notes for implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: transcription + speaker diarization #29

Description

Summary

Why

Key requirements

Technical direction

1) ASR engine: faster-whisper on CPU

2) Diarization: pyannote

3) Optional comparison mode (non-blocking)

Deliverables

Proposed output format

Schema notes

Implementation checklist (Copilot)

Acceptance criteria

Non-goals (initial release)

Notes for implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions