🎙️ VisionVoice

AI-Powered Image Captioning with Live Camera & Audio Output

VisionVoice is a deep learning system that clicks a live photo via webcam,
generates a natural language caption using InceptionV3 + LSTM, and speaks it aloud — all in real time.

📌 Table of Contents

How It Works
Project Structure
Tech Stack
Installation
How to Run
Pipeline Overview
Camera Controls
File Generation Summary

🧠 How It Works

┌──────────────┐    ┌───────────────────┐    ┌─────────-─────────────┐
│   Webcam     │ ▶     InceptionV3 CNN  │ ▶    Feature Vector       │
│  (OpenCV)    │    │  (ImageNet)       │    │  (2048 dimensions)    │
└──────────────┘    └───────────────────┘    └─────────-─┬───────────┘
                                                         │
                                                         ▼
                                             ┌──────────────────────┐
                                             │  Encoder Dense (512) │
                                             └──────────┬───────────┘
                                                         │
┌──────────────┐    ┌───────────────────┐                │
│  [startseq]  │───▶│  Embedding (512)  │───────────────┘
│  (seed word) │    │  + LSTM (512)     │       (merged via add)
└──────────────┘    └───────────────────┘               │
                                                        ▼
                                             ┌──────────────────────┐
                                             │  Decoder Dense (512) │
                                             │  → softmax → word    │
                                             └──────────┬───────────┘
                                                        │
                                              Repeat until 'endseq'
                                                        │
                                                        ▼
                                             ┌──────────────────────┐
                                             │  pyttsx3  →  Speak   │
                                             └──────────────────────┘

📁 Project Structure

VisionVoice/
│
├── 📂 data/
│   ├── 📂 Images/              ← Dataset images + captured.jpg saved here
│   └── 📄 captions.txt         ← Flickr8k-style captions file
│
├── 📄 extract_features.py      ← Step 1 : Extract InceptionV3 features
├── 📄 preprocess.py            ← Step 2 : Tokenize captions, save mapping
├── 📄 model_structure.py       ← Reference: print model summary anytime
├── 📄 train.py                 ← Step 3 : Train the LSTM caption model
├── 📄 camera.py                ← OpenCV webcam module (capture on SPACEBAR)
├── 📄 voice.py                 ← pyttsx3 text-to-speech helper
├── 📄 main.py                  ← Step 4 : Run the live demo end-to-end

🛠️ Tech Stack

Technology	Role
Python 3.8+	Core language
TensorFlow / Keras	Model training & inference
InceptionV3	CNN image feature extractor (pretrained, ImageNet)
LSTM	Sequence model for caption generation
OpenCV	Live webcam capture with countdown
pyttsx3	Offline text-to-speech engine
Matplotlib	Display captured image with generated caption
tqdm	Progress bar during feature extraction

⚙️ Installation

1. Clone the repository

git clone https://github.com/your-username/VisionVoice.git
cd VisionVoice

2. Install all dependencies

pip install tensorflow numpy matplotlib pyttsx3 tqdm opencv-python

3. Prepare the dataset

Download Flickr8k from Kaggle
Place all images inside → data/Images/
Place captions.txt inside → data/

🚀 How to Run

Run the steps in order, once. After that, only main.py is needed every time.

Step 1 — Extract Features

python extract_features.py

Reads every image in data/Images/ through InceptionV3 and saves 2048-dim vectors.
Output: features.pkl

Step 2 — Preprocess Captions

python preprocess.py

Cleans captions, fits a tokenizer, computes max caption length.
Output: tokenizer.pkl, mapping.pkl, max_length.txt

Step 3 — Train the Model

python train.py

Trains for up to 30 epochs with early stopping. Resumes automatically if interrupted.
Output: vision_voice_model.keras
Target loss: below 3.0

Step 4 — Run the Live Demo

python main.py

Step	What Happens
1	Webcam opens with live 720p preview
2	3-second countdown starts on SPACEBAR
3	Photo captured & saved to `data/Images/captured.jpg`
4	InceptionV3 extracts features from the photo
5	LSTM decodes caption word-by-word
6	Image displayed with caption as title
7	Caption spoken aloud via pyttsx3

📷 Camera Controls

Key	Action
`SPACEBAR`	Start 3-second countdown → capture photo
`Q`	Quit camera without capturing

📊 Pipeline Overview

extract_features.py  →  features.pkl
preprocess.py        →  tokenizer.pkl + mapping.pkl + max_length.txt
train.py             →  vision_voice_model.keras
main.py              →  loads all of the above + camera.py + voice.py

🗂️ File Generation Summary

File	Created By	Used By
`features.pkl`	`extract_features.py`	`train.py`, `main.py`
`tokenizer.pkl`	`preprocess.py`	`train.py`, `main.py`
`mapping.pkl`	`preprocess.py`	`train.py`
`max_length.txt`	`preprocess.py`	`train.py`, `main.py`
`vision_voice_model.keras`	`train.py`	`main.py`
`data/Images/captured.jpg`	`camera.py`	`main.py`

⭐ If you found this project helpful, give it a star! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VisionVoice.ipynb		VisionVoice.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ VisionVoice

AI-Powered Image Captioning with Live Camera & Audio Output

📌 Table of Contents

🧠 How It Works

📁 Project Structure

🛠️ Tech Stack

⚙️ Installation

🚀 How to Run

Step 1 — Extract Features

Step 2 — Preprocess Captions

Step 3 — Train the Model

Step 4 — Run the Live Demo

📷 Camera Controls

📊 Pipeline Overview

🗂️ File Generation Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ VisionVoice

AI-Powered Image Captioning with Live Camera & Audio Output

📌 Table of Contents

🧠 How It Works

📁 Project Structure

🛠️ Tech Stack

⚙️ Installation

🚀 How to Run

Step 1 — Extract Features

Step 2 — Preprocess Captions

Step 3 — Train the Model

Step 4 — Run the Live Demo

📷 Camera Controls

📊 Pipeline Overview

🗂️ File Generation Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages