Skip to content

robertvandervoort/ai-dictation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Dictation

Meeting Transcription with Speaker Diarization

AI-Dictation is a powerful tool that transcribes audio recordings while identifying different speakers in the conversation. The application leverages state-of-the-art speech recognition models to provide accurate transcriptions with speaker labels.

Features

  • Multi-speaker diarization (speaker identification)
  • Automatic speech recognition in 100+ languages
  • Translation capability for cross-language understanding
  • Speaker identification with custom naming
  • Export transcripts as text or SRT format
  • Real-time memory usage monitoring
  • GPU acceleration support

Technical Architecture

Note: A detailed architecture diagram would be included here in future updates.

The application integrates several key components in a modular pipeline:

  1. Audio Processing: Handles various input formats and preprocessing via librosa/torchaudio
  2. Speaker Diarization: Uses PyAnnote's speaker-diarization-3.1 model to segment audio by speaker
  3. Speech Recognition: Routes audio segments through the selected ASR model (Whisper/Wav2Vec2/Seamless)
  4. Translation Layer: For multilingual content, connects to appropriate translation backend
  5. Streamlit UI: Provides the interactive frontend for all operations

The pipeline uses a segment-based approach rather than processing the entire audio at once, enabling efficient memory usage and precise speaker attribution. Temporary files are managed with unique identifiers to prevent collisions during parallel processing.

Model Comparison

We've experimented with multiple speech recognition models including Wav2Vec2 and Whisper, finding superior results with OpenAI's Whisper. Research published in this paper confirms our experience, showing Whisper consistently outperforming Wav2Vec2 for most speech recognition tasks.

Due to these findings, Wav2Vec2 will be removed in an upcoming release to streamline the application and focus on the best performing models.

Note on Language Support: We are actively working on properly mapping supported languages for speech-to-text and text-to-text translation in the code. This will be fixed soon. For the most common languages encountered in US business, speech-to-text and text-to-text translation is already working excellently.

Model Specifications

Model Parameters Disk Size RAM/VRAM Required Relative Speed Supported Languages
Whisper tiny 39M ~150MB ~1GB ~10x 96+
Whisper base 74M ~290MB ~1GB ~7x 96+
Whisper small 244M ~970MB ~2GB ~4x 96+
Whisper medium 769M ~3.1GB ~5GB ~2x 96+
Whisper large-v3 1.55B ~6.2GB ~10GB 1x (baseline) 96+
Wav2Vec2 base 95M ~360MB ~1GB ~6x English-focused
Wav2Vec2 large 317M ~1.2GB ~3GB ~3x English-focused
Wav2Vec2 XLS-R 300M 300M ~1.2GB ~3GB ~3x 128+
Seamless M4T v2 large 1.2B ~4.8GB ~8GB ~2x 100+

Processing performance (approximate, RTX 4090): (UPDATE WITH REAL METRICS ON VARIOUS HARDWARE)

  • Whisper large-v3: ~12x faster than real-time (1 minute audio processed in ~5 seconds)
  • Seamless M4T: ~6x faster than real-time (1 minute audio processed in ~10 seconds)
  • Speaker diarization: ...

Installation

Easy Installation (Recommended)

Windows

run.bat

macOS/Linux

./run.sh

Manual Installation

  1. Create a virtual environment:
python -m venv venv
  1. Activate the virtual environment:

    • Windows: venv\Scripts\activate
    • macOS/Linux: source venv/bin/activate
  2. Install requirements:

pip install -r requirements.txt
  1. Launch the application:
streamlit run dictate.py

Usage Guide

Transcription Engine Options

The application offers several speech recognition models:

Whisper (Optimized)

  • Recommended option for most use cases
  • Excellent accuracy across many languages
  • Size options from tiny to large-v3, offering speed vs. accuracy trade-offs
  • Handles ambient noise, accents, and technical vocabulary well
  • Model sizes:
    • Tiny: Fastest option, less accurate but good for quick drafts
    • Base: Good balance of speed and accuracy
    • Small: Better quality with reasonable speed
    • Medium: High quality transcription
    • Large-v3: Best quality, but slowest and requires more GPU memory

Whisper Approach Whisper's approach: A Transformer sequence-to-sequence model trained on various speech processing tasks

Wav2Vec2

  • Provides uppercase output without punctuation
  • Options include English-specific models and multilingual models
  • Generally requires more post-processing
  • Note: To be removed in future releases

Seamless M4T

  • Specialized for multilingual environments and cross-language translation
  • End-to-end speech-to-speech and speech-to-text translation
  • Supports 100+ languages with a single model
  • Better handling of code-switching (multiple languages in one conversation)
  • Particularly useful for:
    • International meetings with multiple languages
    • Direct translation without intermediate steps
    • Content requiring high-quality translation between languages
  • Implementation varies by platform:
    • Linux: Uses native seamless_communication package
    • Windows/macOS: Uses Hugging Face Transformers implementation

When to Choose Each Model

Feature Whisper Seamless M4T Wav2Vec2
Best for general transcription ⚠️
Best for multilingual content ⚠️
Best for direct translation
Processing speed Fast Medium Fast
Memory requirements Moderate High Low
Punctuation & capitalization
Language identification
Handles technical vocabulary ⚠️ ⚠️
  • Whisper: Choose for most everyday transcription needs, especially when working in a single language or need the best balance of accuracy and speed
  • Seamless M4T: Choose when working with multiple languages in the same recording or when direct translation is the primary goal
  • Wav2Vec2: Legacy option, only use if you specifically need uppercase-only output or have compatibility requirements

Advanced Settings

Minimum Segment Duration

  • Controls the shortest audio segment that will be transcribed (0.5-5.0 seconds)
  • Lower values capture more speech but may introduce more errors
  • Higher values focus on longer, more meaningful utterances
  • Default (1.5s) works well for most conversations

Audio Loading Options

  • Use librosa: More reliable but slightly slower audio processing
  • Clean temporary files: Automatically removes temporary files after processing

GPU Acceleration

  • Enables faster processing on CUDA-compatible GPUs
  • Automatically detects available hardware
  • Shows real-time memory usage

Workflow

  1. Upload an audio file (WAV, MP3, FLAC, OGG)
  2. The system identifies different speakers
  3. Assign names to each speaker
  4. View the complete transcript
  5. Translate segments to your preferred language if needed
  6. Download the transcript as text or SRT format

Performance Considerations

  • GPU vs CPU: Using a GPU dramatically speeds up processing, especially for large files
  • Model Size: Larger models provide better accuracy but require more memory and processing time
  • File Length: Longer recordings require more processing time and resources
  • Memory Usage: The performance meter helps monitor resource utilization
  • Platform-specific considerations:
    • Seamless M4T performs best on Linux with the native implementation
    • Windows/macOS users may see better performance with Whisper for large files

Implementation Details

Speaker Diarization

  • Uses PyAnnote's ECAPA-TDNN embedding model with 192 dimensions
  • 3.1 version includes improved handling of overlapping speech
  • Diarization clustering is performed using agglomerative hierarchical clustering (AHC)
  • Post-processing includes a 0.5s minimum segment duration filter

ASR Implementation

  • Whisper uses CTranslate2 optimized runtime via faster-whisper
  • 8-bit quantization for CPU inference, 16-bit float for GPU inference
  • Custom segment extraction optimized for speaker-diarized input
  • Fallback mechanisms to handle transcription failures
  • Context window of 30 seconds for long-form content

Translation Pipeline

  • Two-stage pipeline for unsupported language combinations
  • Neural Machine Translation implemented with NLLB-200 distilled models
  • Language detection verification to prevent unnecessary translations
  • Batching system for efficient processing of multiple segments

Known Limitations

  • Speaker Diarization: Performance degrades when speakers have similar voices or with significant background noise
  • Overlapping Speech: Accuracy drops significantly when multiple people speak simultaneously
  • Domain-specific Terminology: Technical, medical, or specialized vocabulary may be transcribed incorrectly
  • Heavy Accents: Non-standard accents can reduce transcription accuracy, particularly with smaller models
  • Very Long Files: Files exceeding 2 hours may require significant memory and processing time
  • Low-quality Audio: Recordings with sampling rates below 16kHz or significant compression artifacts show reduced accuracy

Customization Options

Fine-tuning for Domain-specific Vocabulary

While not included in the UI, advanced users can fine-tune the models for domain-specific vocabulary:

  1. For Whisper, see the OpenAI fine-tuning documentation
  2. For Seamless M4T, see Meta's fine-tuning guide

Research Citations

The implementation is based on the following research papers:

Developer Guide

Code Organization

  • dictate.py: Main application file with all functionality
  • Key components:
    • Audio processing (lines 70-180)
    • Model loading functions (lines 180-250)
    • Transcription functions (lines 250-350)
    • Translation implementations (lines 350-450)
    • Streamlit UI components (lines 450+)

Key Algorithms

  1. Two-phase diarization: Speaker segmentation followed by speaker clustering
  2. Optimized segment transcription: Uses a sliding window approach for consistent results
  3. Language detection: Confidence-based language identification with fallback mechanisms
  4. Translation memory: Caches translations to avoid redundant processing

Extending with New Models

To add a new ASR model:

  1. Create a new model loading function following the pattern of existing loaders
  2. Implement a segment transcription function specific to the model
  3. Add UI elements to select the new model
  4. Update the pipeline to handle the new model type

History of Speech Recognition Models

Whisper

  • Developed by OpenAI and released in September 2022
  • Trained on 680,000 hours of multilingual and multitask supervised data collected from the web
  • Uses a "weakly supervised" approach that enables robust performance across many languages
  • Distinguished by its ability to handle a variety of acoustic environments and accents
  • The large-v3 model (released in 2023) achieves near human-level accuracy in English speech recognition
  • Open-sourced with an MIT license, making it widely accessible for developers

Seamless M4T

  • Developed by Meta AI (Facebook) and released in August 2023
  • Stands for "Massively Multilingual & Multimodal Machine Translation"
  • Part of Meta's "No Language Left Behind" initiative to create AI systems for all languages
  • First unified model capable of speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation
  • Seamless M4T v2 (released in 2024) significantly improved quality and expanded language support
  • Notable for preserving speakers' voices and emotion in translated speech
  • Released under an open license to encourage research and application development

Wav2Vec2

  • Developed by Facebook AI Research (now Meta AI)
  • Initial Wav2Vec model released in 2019, with Wav2Vec 2.0 following in 2020
  • Pioneered self-supervised learning for speech recognition
  • Breakthrough approach that could be fine-tuned with very small amounts of labeled data
  • Made significant advancements in low-resource speech recognition
  • Enabled speech recognition systems for languages with limited training data
  • Served as the foundation for many subsequent speech recognition models
  • Released with the MIT license as part of the HuggingFace Transformers library

Future Roadmap

  • Automatic speaker name identification using information from introductions in meetings
  • Advanced diarization to identify overlapping speakers
  • Voice cloning for speech-to-speech and text to speech translation
  • Transcript to speech output for many lanugage translation
  • Real-time transcription mode for live meetings
  • Integration with semantic analysis for topic extraction and summarization

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions, please open an issue on GitHub.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks