Skip to content

this file contains the trained model of wav2lip+gan for lip sync

Notifications You must be signed in to change notification settings

alphaaico/Edukut

Repository files navigation

RASA - Contextual Conversations

Rasa is an open source machine learning framework for automated text and voice-based conversations. Understand messages, hold conversations, and connect to messaging channels and APIs.

In a contextual conversation, something beyond the previous step in the conversation plays a role in what should happen next. For example, if a user asks "How many?", it's not clear from the message alone what the user is asking about. In the context of the assistant saying, "You've got mail!", the response could be "You have five letters in your mailbox". In the context of a conversation about outstanding bills, the response could be, "You have three overdue bills". The assistant needs to know the previous action to choose the next action.

WorkFlow of our work

Alt text

wav2lip

This file contains the trained model of wav2lip+gan for lip sync.

The trail of this model was performed on google colab.

The model is giving great lip-sync results.

RESULTS OF WAV2LIP

https://drive.google.com/drive/folders/1Ww6DISQBdYbs1ojHBYe3mS-066Rdg7vz?usp=sharing

For the testing of model,we first created the avatar by using a FIRST ORDER MODEL

Then we applied wav2lip+gan on the output of FIRST ORDER MODEL to determine the quality of lip sync

Text to Speech

Click here to go to my research work.

Click here For Python Code using IBM Watson.

using gTTS

There are several APIs available to convert text to speech in Python. One of such APIs is the Google Text to Speech API commonly known as the gTTS API. gTTS is a very easy to use tool which converts the text entered, into audio which can be saved as a mp3 file. The gTTS API supports several languages including English, Hindi, Tamil, French, German and many more. The speech can be delivered in any one of the two available audio speeds, fast or slow. However, as of the latest update, it is not possible to change the voice of the generated audio.

using IBM Watson

IBM Watson Text to Speech gives your brand a voice, enabling you to improve customer experience and engagement by interacting with users in their own languages using any written text. Increase accessibility for users with different abilities, provide audio options to avoid distracted driving, or automate customer service interactions to increase efficiencies.

Powerful real-time speech synthesis

Create Custom Voices

using Google Cloud

Convert text into natural-sounding speech using an API powered by Google’s AI technologies.

• Improve customer interactions with intelligent, lifelike responses

• Engage users with voice user interface in your devices and applications

• Personalize your communication based on user preference of voice and language

BENEFITS

High fidelity speech

• Deploy Google’s groundbreaking technologies to generate speech with humanlike intonation. Built based on DeepMind’s speech synthesis expertise, the API delivers voices that are near human quality.

Widest voice selection

• Choose from a set of 220+ voices across 40+ languages and variants, including Mandarin, Hindi, Spanish, Arabic, Russian, and more. Pick the voice that works best for your user and application.

One-of-a-kind voice

• Create a unique voice to represent your brand across all your customer touchpoints, instead of using a common voice shared with other organizations.

key Features

Custom Voice (beta)

Train a custom voice model using your own audio recordings to create a unique and more natural sounding voice for your organization. You can define and choose the voice profile that suits your organization and quickly adjust to changes in voice needs without needing to record new phrases.

WaveNet voices

Take advantage of 90+ WaveNet voices built based on DeepMind’s groundbreaking research to generate speech that significantly closes the gap with human performance.

Voice tuning

Personalize the pitch of your selected voice, up to 20 semitones more or less from the default. Adjust your speaking rate to be 4x faster or slower than the normal rate.

Text and SSML support

Customize your speech with SSML tags that allow you to add pauses, numbers, date and time formatting, and other pronunciation instructions.

Custom Background

With the help of Scipy, Numpy and Pillow libraries, the removal of image background is shown. The trail of this model is performed on Google colab. However, The result is not as we expected. Foreground and background are not segmented correctly .
We are looking forward to different deep learning approaches like Modnet Architecture, Semantic segmentation to have the required output.

Thank you

Team Edukut

About

this file contains the trained model of wav2lip+gan for lip sync

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6