Audio Transcription with Gemini 2.5 Pro (With outstanding accented language recognition and accuracy)
This project uses Google's Gemini 2.5 Pro model to transcribe audio files. It includes features for handling multiple speakers, different accents, multiple languages, and formats the transcript with timestamps. It can process individual audio files or all supported audio files within a directory and its subdirectories. Files longer than 59 minutes are automatically chunked, processed, and stitched back together.
- Transcription Engine: Utilizes Google's Gemini 2.5 Pro model for audio-to-text conversion.
- Input Handling:
- Accepts a single audio file path or a directory path as input via command-line argument.
- If a directory is provided, it recursively scans for audio files.
- Supported audio formats:
.m4a,.mp3,.wav,.flac,.ogg,.aac(requiresffmpegorlibavforpydub).
- Long Audio Processing (Chunking):
- Automatically detects audio files longer than 59 minutes.
- Splits long audio files into manageable chunks (currently 59-minute segments).
- Transcribes each chunk individually.
- Normalizes timestamps across chunks to ensure a continuous timeline in the final transcript.
- Stitches chunk transcripts together to form one cohesive transcript.
- Speaker & Language Handling:
- Designed to handle multiple speakers with distinct accents (configurable in the prompt template).
- Supports transcription of conversations involving multiple languages.
- Provides confidence scores for uncertain words (as per Gemini model capability).
- Output Format:
- Generates a Markdown (
.md) file for each processed audio file. - The output file is named after the input audio file (e.g.,
audio.m4a->audio.md) and saved in the same directory as the input audio file. - The
.mdfile includes:## Summary: A bullet-point summary of the entire audio content (generated from the full, potentially stitched, transcript).## Transcript: The full transcript formatted within a Markdown code block for readability and easy copying, withHH:MM:SStimestamps.
- Generates a Markdown (
- Overwrite Control:
- By default, skips processing an audio file if its corresponding
.mdoutput file already exists. - Provides a
--replace(or-r) command-line flag to force overwrite of existing.mdfiles.
- By default, skips processing an audio file if its corresponding
- Prompt Engineering: Uses a Jinja2 template for detailed transcription instructions to the Gemini model.
- Server-Side File Management: Attempts to delete uploaded audio files/chunks from Google's servers after processing to manage cloud storage.
-
Google API Key:
- You need a Google API Key with access to the Gemini API.
- Create a
.envfile in the same directory as theGemini_2_5_pro_audio_transcription.pyscript. - Add your API key to the
.envfile:GOOGLE_AI_STUDIO=your_actual_api_key_here
-
Python Environment & Dependencies:
- Ensure you have Python installed (script is currently set to prefer ~=3.12, but >=3.9 should generally work).
ffmpegorlibavRequirement: The script uses thepydublibrary for audio manipulation (like duration checking and splitting).pydubrelies onffmpegorlibavto handle various audio formats. You must haveffmpeg(recommended) orlibavinstalled on your system and available in your system's PATH.- Download
ffmpegfrom https://ffmpeg.org/download.html
- Download
- You can manage dependencies using
uv(orpip).
Using
uv(Recommended Method 1 - Script-defined dependencies): If youruvversion supports PEP 723 foruv run:# uv will read dependencies from the /// script block at the top of the Python file. # Dependencies include: google-generativeai, jinja2, python-dotenv, pydub # No separate install step needed if you use `uv run` directly.
Using
uv(Method 2 - Manual virtual environment):# Create a virtual environment uv venv # Activate it (Linux/macOS) source .venv/bin/activate # Or (Windows) # .venv\Scripts\activate # Install dependencies uv pip install google-generativeai jinja2 python-dotenv pydub
Using
pip(Traditional method):# Create a virtual environment (optional but recommended) # python -m venv .venv # source .venv/bin/activate # or .venv\Scripts\activate pip install google-generativeai jinja2 python-dotenv pydub
-
Prepare your audio file(s).
-
Open your terminal and navigate to the project directory.
-
Execute the script, providing the path to your audio file or a directory containing audio files.
Basic Usage (single file or directory):
# Using uv run (if dependencies are in the script block) uv run yt_gemini_2_5_pro_audio_transcription.py path/to/your/audio_or_directory # Using python (if you manually created a venv and installed dependencies) python yt_gemini_2_5_pro_audio_transcription.py path/to/your/audio_or_directory
Examples:
python yt_gemini_2_5_pro_audio_transcription.py Multi-Audio/episode1.m4a python yt_gemini_2_5_pro_audio_transcription.py Multi-Audio/
-
To Overwrite Existing Transcriptions: Use the
-ror--replaceflag:python yt_gemini_2_5_pro_audio_transcription.py path/to/your/audio_or_directory --replace uv run yt_gemini_2_5_pro_audio_transcription.py path/to/your/audio_or_directory -r
-
Output:
- The script will print progress messages to the console.
- For each processed audio file, an
.mdfile containing the summary and transcript will be saved in the same directory as the source audio file.
main(input_path_arg, replace_flag): Parses command-line arguments and orchestrates file/directory processing.process_audio_file(audio_file_path, replace_existing): Main logic for handling a single audio file. Checks duration, decides on chunking, calls processing functions, and manages output file creation._get_audio_duration_ms(file_path): Returns the duration of an audio file in milliseconds usingpydub._split_audio_into_chunks(original_file_path, temp_dir_for_chunks): Splits a long audio file into smaller, manageable chunks._transcribe_and_process_segment(segment_path, time_offset_ms): Handles uploading a single audio segment (original file or chunk) to Google, getting the transcription, applying initial formatting, and normalizing timestamps based on the given offset._generate_summary_for_text(full_transcript_text): Sends the complete (potentially stitched) transcript to Gemini to generate a summary.timestamp_to_seconds(ts_str)&seconds_to_timestamp(total_seconds): Utility functions for timestamp conversions.process_transcript(input_text, max_segment_duration): Formats the raw transcript from Gemini (consolidates speaker lines, etc.).get_output_path(input_audio_path): Determines the output.mdfile path based on the input audio file path.
- Prompt Template: The
prompt_template_textvariable in the script contains the detailed instructions given to the Gemini model. You can customize this, especially thespeakers_list, to better suit your audio content. - Chunk Duration:
MAX_CHUNK_DURATION_MINUTES(currently 59 minutes) can be adjusted if needed, though an hour is a general guideline for many audio APIs.
- API Key Management: Change how
GOOGLE_API_KEYis obtained. Environment variables are a common practice. - File Paths: Adjust
file_pathto your local file system. Remove Google Drive mounting (drive.mount('/content/drive')). - Package Installation: The
!pipcommand is for Colab. Ensure packages are installed in your local Python environment as described in the Setup section. - IPython Display:
IPython.display.Audiois for displaying audio in Jupyter/Colab. For a local script, this line might cause an error or simply do nothing if not in an appropriate environment. You can remove it if not needed.