Skip to content

dmisino/speech_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speech_models

This project demonstrates speech-to-text (stt) and text-to-speech (tts). It makes use of freely available machine learning models, so no API keys are required. Since the models run locally, the speed of the tts and sst conversions will depend on your computer, though they should run fine on any modest system.

The code was tested on a Windows system, though it was written to be compatible with Mac or Linux as well. If you encounter any problems, feel free to open an issue.

Text to speech (tts.py)

The tts.py page uses Silero v3 audio models for text-to-speech, and simpleaudio for audio playback.

Speech to text (stt.py)

The stt.py page uses pvrecorder for recording audio, and the OpenAI whisper model for transcription.

Features

tts

  • Launch the tts.py page, and a default string from the code will be read aloud. Optionally launch the page with a command line argument (--text "your text here") and that will be used. A random speaker voice will be used each time, or see command line options below to use a specific speaker voice.

stt

  • Launch the stt.py page, and once the message that audio is recording appears, speak to your computer and it will be transcribed to text, then printed to the console.

Clone the repository

git clone https://github.com/dmisino/speech_models.git
cd speech_models

Installation and setup

To run this project, you'll need Python installed on your machine. You can install from python.org.

Whisper

The OpenAI whisper module requires ffmpeg, which must be fully installed. You may also need Rust installed. See the whisper github page for more details.

If you have properly installed whisper and it's dependencies, you can test it with the following commands. The wav file referenced is included with this repo, and was taken from Open Speech Repository. The sentences contained in the file may be viewed here, under "list 3" for the specific audio file included in the example below.

rem Try using whisper to transcribe audio to text 
whisper "sample\OSR_us_000_0012_8k.wav" --model tiny --language en

rem Try using gpu
whisper "sample\OSR_us_000_0012_8k.wav" --model tiny --language en --device cuda

When using OpenAI Whisper for speech-to-text, the provided code will use a gpu if available, but this requires a gpu-enabled version of pytorch. If you already have pytorch installed, you would need to uninstall and then add the gpu-enabled version if you would like to use your gpu for stt transcription:

pip uninstall torch
pip cache purge 
pip install torch -f https://download.pytorch.org/whl/torch_stable.html

Usage

To run speech-to-text or text-to-speech, run the appropriate page:

python stt.py

rem OR

python tts.py

Command line options

Text-to-speech

The text-to-speech page (tts.py) has a few command line options available. All are optional.

By default a random english speaker is selected. You may specify a different language, model and speaker. To specify these using valid options, see silero models and speakers. Go here to listen to samples of the 118 different available english speakers.

rem tts.py command line options

rem Specify text to be read.
python tts.py --text "<your text here>"

rem Choose language. Default is "en".
python tts.py --language "<language>"

rem Specify model. Default is "v3_en".
python tts.py --model "<model>"

rem Specify speaker (voice) used. Default is "random".
python tts.py --speaker "<speaker>"

rem Example specifying Russian, with speaker "xenia" and Russian text
python tts.py --language "ru" --model "v3_1_ru" --speaker "xenia" --text "Это какой-то русский текст, озвученный моделью машинного обучения. Доступно несколько языков, но русский звучит круто."

Speech-to-text

The speech-to-text (stt.py) page has one available command line argument to change the whisper model used. There are 5 options, with differences in size, memory requirements and speed. The default model used is "tiny" which is fastest and with the lowest resource requirements.

rem stt.py command line options

rem Specify model to use among options [tiny, base, small, medium, large].
python stt.py --model "<model option>"

About

Demonstrations of text-to-speech and speech-to-text using machine learning models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages