Skip to content

prateekmtri/full_voice_agent_pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoiceAI Pro: Real-time Conversational AI Assistant

VoiceAI Pro is a sophisticated, real-time, voice-driven conversational AI assistant designed to provide a uniquely engaging user experience. Move beyond traditional text-based chats and interact with 'Circuit'—an AI with the personality of a friendly, street-smart "Mumbai Tapori." This project leverages an end-to-end streaming architecture to ensure that the journey from your voice input to the AI's audio response is seamless and instantaneous.


▶️ View Live Demo


🌟 Showcase

Below is a snapshot of the application's live interface. For the best demonstration, consider replacing this image with a short GIF or video that showcases the application in action.

VoiceAI Pro Screenshot

✨ Features - A Detailed Look

Every feature has been thoughtfully implemented to deliver a superior user experience.

  • 🎙️ Real-time Voice Conversation No typing required. Simply press the record button, speak your query, and receive an audio response from the AI. The experience is designed to be as fast and natural as a phone call.

  • 🚀 End-to-End Streaming Architecture This is the core of the project's architecture. Data doesn't wait; it flows.

    1. Live Speech-to-Text: As you speak, your voice is transcribed into text in real-time (AssemblyAI).
    2. Streaming LLM Responses: The AI's response is generated in chunks, causing the text to appear on the screen almost instantly (Google Gemini).
    3. Streaming Text-to-Speech: The AI's text response is simultaneously converted into audio and streamed back, eliminating delays in the audio response (Murf.ai).
  • 😎 Unique AI Persona: 'Circuit' This isn't a typical, robotic AI. 'Circuit' interacts with you in a Mumbai-style Hinglish dialect. Its responses are friendly, street-smart, and entertaining, embodying a unique and memorable character.

  • 🎬 Special Skill: Movie Dialogue Finder Feeling bored? Ask Circuit to recite a famous dialogue from a Bollywood movie. For example: "Circuit, tell me a dialogue from Sholay." and prepare to be entertained!

  • 🧮 Special Skill: Instant Calculator No need to open a separate calculator app for complex calculations. Just ask Circuit. For example: "What is 25 multiplied by 4?" and you will get an instant answer.

  • 🔐 Secure Local API Key Management Your privacy is a top priority. API keys are stored exclusively in your browser's localStorage. They are never sent to or stored on the server, ensuring your credentials remain 100% secure.

  • 🎨 Futuristic UI & UX The application features a modern and visually appealing design. An aurora background, engaging animations, and a clean layout provide a premium and immersive user experience.


🛠️ Core Architecture & Data Flow

This project operates on a seamless, event-driven data flow. Let's trace the journey of a user's voice from input to response:

  1. 🎤 Step 1: Audio Capture (Frontend) The user clicks the record button. The browser's getUserMedia API captures audio from the microphone.

  2. ⏩ Step 2: Audio Streaming (Frontend → Backend) The captured audio is down-sampled to 16-bit PCM format (16,000 Hz) and streamed in real-time to the FastAPI server via a WebSocket connection.

  3. 📝 Step 3: Live Transcription (Backend → AssemblyAI) The backend immediately forwards the incoming audio stream to AssemblyAI's real-time transcription service through another WebSocket.

  4. 🗣️ Step 4: Final Transcript (AssemblyAI → Backend) Once the user stops speaking, AssemblyAI detects the end of the utterance and sends the final, formatted transcript back to the backend.

  5. 🧠 Step 5: The Brain of the Operation (Backend Logic)

    • The backend receives the user's text transcript.
    • It first checks if the query triggers a special skill (like Calculation or Movie Dialogue).
    • If it's a general query, the text is streamed to the Google Gemini LLM, which is configured with the 'Circuit' persona system prompt.
    • Gemini begins generating a response in character.
  6. ✍️ Step 6: Text Streaming (Backend → Frontend) The response from Gemini is streamed back to the Frontend in chunks. This allows the user to see the response appearing on the screen in real-time.

  7. 🔊 Step 7: Voice Synthesis (Backend → Murf.ai) Simultaneously, the complete text response from Gemini is streamed to Murf.ai for Text-to-Speech synthesis.

  8. 🎧 Step 8: Audio Streaming (Murf.ai → Backend → Frontend) Murf.ai converts the text into audio and streams the audio chunks back to the Backend, which then forwards them to the Frontend.

  9. ▶️ Step 9: Playback (Frontend) The Frontend assembles all the incoming audio chunks into a single Audio Blob and plays it back to the user, completing the conversation loop with minimal latency.


💻 Tech Stack & Rationale

Technology Rationale (Why it was chosen)
FastAPI (Backend) Ideal for real-time applications due to its high performance, native asynchronous support, and excellent WebSocket handling.
Vanilla JavaScript (Frontend) Keeps the project lightweight and fast without the overhead of a heavy framework. Provides full control over audio processing and DOM manipulation.
AssemblyAI Its real-time streaming transcription service is recognized for its industry-leading speed and accuracy.
Google Gemini A powerful and fast LLM that supports streaming responses. Its System Prompt feature is crucial for maintaining the AI's persona.
Murf.ai Provides high-quality, natural-sounding voices, and its streaming TTS API is key to reducing audio response latency.
Uvicorn A lightning-fast ASGI server for FastAPI, perfect for both development and production environments.

🚀 Getting Started: Local Setup & Installation Guide

Follow these steps to set up and run the project on your local machine.

(Step 0) Prerequisites

Ensure you have the following installed and ready:

(Step 1) Clone the Repository

git clone [https://github.com/](https://github.com/)[your-github-username]/Murf_ai_project.git
cd Murf_ai_project

(Step 2) Setup the Backend

  • Create a Virtual Environment: This is a best practice to isolate project dependencies.
    python -m venv venv
  • Activate the Environment:
    # On Windows:
    .\venv\Scripts\activate
    
    # On macOS/Linux:
    source venv/bin/activate
  • Install Dependencies:
    pip install -r requirements.txt

(Step 3) Run the Backend Server

  • Execute the following command in your terminal:
    uvicorn main:app --reload
  • Your server will now be running at http://127.0.0.1:8000.

(Step 4) Configure the Frontend

  • Open your web browser and navigate to http://127.0.0.1:8000.
  • Click the settings icon (⚙️) in the top-right corner.
  • Enter your API keys for AssemblyAI, Murf, Gemini, and TMDB.
  • Click "Save Configuration". The application is now ready to use!

⚙️ API Key Configuration Details

This project supports two methods for handling API keys:

  1. Browser localStorage (Recommended Method):

    • How it works: When you enter your keys in the frontend settings modal, they are saved securely in your browser's local storage.
    • Advantage: This is the most secure method. The keys remain on your client machine and are sent directly from the frontend when establishing the WebSocket connection. The server never stores these keys.
  2. .env File (Server-side Fallback):

    • What it is: You can store your keys in a .env file in the project's root folder.
    • Purpose: This serves as a fallback. If keys are not provided by the browser, the server will use these keys instead.
    • Setup: Create a file named .env in the root directory and add your keys as follows:
      ASSEMBLYAI_API_KEY="your_assemblyai_api_key"
      MURF_API_KEY="your_murf_api_key"
      GEMINI_API_KEY="your_gemini_api_key"
      TMDB_API_KEY="your_tmdb_api_key"

Important: The .env file is already listed in .gitignore to prevent you from accidentally committing your secret keys to GitHub.


📈 Future Improvements

This project has the potential for further expansion:

  • Multiple Personas: Allow users to choose from different AI personas (e.g., a teacher, a comedian, a formal assistant).
  • Session History: Implement a feature to save and load conversation history.
  • More Skills: Integrate new skills like fetching live weather updates, news headlines, or music recommendations.
  • Progressive Web App (PWA): Convert the application into a PWA to make it installable on devices for an app-like experience.

👤 Author

Prateek Mani Tripathi

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published