🧿 VisionAI — Multimodal Visual Question Answering System

A production-grade, visually stunning AI web application that can see and understand images.

🌟 What Is VisionAI?

VisionAI is a multimodal AI application that combines computer vision and natural language processing to let users ask questions about images in plain English and receive intelligent answers — powered by the BLIP (Bootstrapping Language-Image Pre-training) Vision-Language Model from Salesforce.

The interface is designed to look and feel like an expensive AI SaaS product, with glassmorphism design, smooth animations, and a dark-mode futuristic aesthetic.

✨ Features

🔮 Core AI Features

Feature	Description
Visual Q&A	Ask any natural-language question about an uploaded image
Image Captioning	Automatically generate descriptive captions for images
Object Detection	Detect and list objects present in the image
OCR Extraction	Extract visible text from images using Tesseract
Confidence Scores	Every answer comes with an estimated confidence score

🎨 UI/UX Features

Feature	Description
Glassmorphism Design	Premium dark-mode UI with blur effects and gradients
Animated Hero	Gradient text animations, floating chips, smooth entrance
Sidebar Navigation	Clean navigation between all app sections
Floating AI Icon	Pulsing animated assistant icon in the corner
Text-to-Speech	AI reads answers aloud via gTTS
Camera Capture	Use your webcam to capture and analyse images
Session History	Chat-style Q&A history maintained throughout session
Analytics Dashboard	Session statistics, confidence chart, performance metrics
Export Tools	Download session as JSON or formatted text report
Mobile Responsive	Works beautifully on all screen sizes

🧠 Question Types Supported

✅ Yes/No questions — "Is there a dog in this image?"
🔢 Counting — "How many cars are visible?"
🎨 Color recognition — "What color is the car?"
🏷️ Object identification — "What is the main object?"
🏙️ Scene understanding — "Is this indoors or outdoors?"
🧑 Activity recognition — "What is the person doing?"
📍 Location questions — "Where does this appear to be taken?"

🚀 Quick Start

Prerequisites

Python 3.9 or higher
pip or conda
(Optional) CUDA-capable GPU for faster inference

1. Clone the Repository

git clone https://github.com/your-username/multimodal-vqa.git
cd multimodal-vqa

2. Create a Virtual Environment

# Using venv
python -m venv venv
source venv/bin/activate       # Linux/macOS
venv\Scripts\activate          # Windows

# Or using conda
conda create -n visionai python=3.10
conda activate visionai

3. Install Dependencies

pip install -r requirements.txt

Install Tesseract OCR (for OCR feature)

# Ubuntu / Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download installer from: https://github.com/tesseract-ocr/tesseract
# Add installation folder to PATH

4. Run the Application

streamlit run app.py

Open your browser and navigate to http://localhost:8501 🎉

📁 Project Structure

VisionAI/
│
├── app.py                    ← Main Streamlit application entry point
├── requirements.txt          ← Python dependencies
├── README.md                 ← This file
├── .gitignore                ← Git ignore rules
│
├── model_utils.py            ← BLIP model loading & AI inference
├── image_utils.py            ← Image preprocessing & metadata
├── ocr_utils.py              ← Tesseract OCR integration
├── tts_utils.py              ← Text-to-speech (gTTS + pyttsx3)
├── export_utils.py           ← JSON & HTML report export logic
├── filter_utils.py           ← CV filters & image enhancement algorithms
├── classification_utils.py   ← Quality classification, color extraction, scene classification
├── similarity_utils.py       ← Feature extraction, ORB/SSIM image comparison logic
├── ui_components.py          ← HTML/CSS layout and component generator
├── styles.py                 ← Cyber-Neon custom dark/light theme CSS overrides
│
├── setup_windows.bat         ← Automated setup script for Windows
├── run_visionai.bat          ← App starter script for Windows
│
├── VisionAI_Colab_Run.ipynb  ← Colab Runner notebook
├── VisionAI_Colab.zip        ← Package zip for Colab environment
└── venv/                     ← Virtual environment directory (ignored by git)

🧪 Sample Test Questions

Try these example questions after uploading an image:

Image Type	Question	Expected Answer
Street scene	"How many cars are visible?"	Number
Portrait	"What is the person doing?"	Activity description
Food photo	"What food is shown?"	Food name
Landscape	"Is this taken outdoors?"	"yes"
Text screenshot	"What does the text say?"	OCR result
Animals	"What animal is in the image?"	Animal name
Colour test	"What color is the dominant object?"	Color name

⚙️ Configuration

GPU Support

By default the app runs on CPU. To enable GPU (CUDA):

pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Then in utils/model_utils.py the DEVICE variable is auto-detected:

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

BLIP-2 (Larger, More Powerful Model)

To use BLIP-2 instead of BLIP, edit utils/model_utils.py:

# Requires ~16GB VRAM / 32GB RAM
from transformers import Blip2Processor, Blip2ForConditionalGeneration

VQA_MODEL_ID = "Salesforce/blip2-opt-2.7b"

🌐 Deployment

Option 1: Streamlit Community Cloud (Free)

Push your code to GitHub
Go to share.streamlit.io
Connect your GitHub repo
Set main file as app.py
Add any secrets in the Secrets manager
Click Deploy ✓

Note: The free tier uses CPU; model loading may take ~60s on first run.

Option 2: Hugging Face Spaces

Create a new Space at huggingface.co/spaces
Choose Streamlit as the SDK
Upload all project files or connect your GitHub repo
Hugging Face will automatically install requirements.txt
Set hardware to CPU Basic (free) or T4 GPU (paid)

Add a README.md with frontmatter:

---
title: VisionAI VQA
emoji: 🧿
colorFrom: indigo
colorTo: cyan
sdk: streamlit
sdk_version: 1.35.0
app_file: app.py
pinned: false
---

Option 3: Local / Server Deployment

# Production server with Streamlit
streamlit run app.py \
  --server.port 8501 \
  --server.address 0.0.0.0 \
  --server.headless true \
  --browser.gatherUsageStats false

Option 4: Docker

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]

docker build -t visionai .
docker run -p 8501:8501 visionai

🔧 Troubleshooting

Issue	Solution
Model download slow	First run downloads ~1GB; subsequent runs use cache
OCR returns empty	Install Tesseract OS binary (see Prerequisites)
TTS not working	Install gTTS: `pip install gTTS`; requires internet
CUDA out of memory	Switch to `torch.float32` or use CPU
Slow inference	Normal on CPU (~3-8s); GPU reduces to <1s
Port already in use	`streamlit run app.py --server.port 8502`

📊 Model Information

Property	Value
Model	Salesforce/blip-vqa-base
Type	Vision-Language Model (BLIP)
Parameters	~385M
Task	Visual Question Answering
Input	Image + Text Question
Output	Text Answer
License	BSD-3-Clause

🛠️ Tech Stack

Layer	Technology
Frontend	Streamlit + Custom CSS (Glassmorphism)
AI Model	BLIP VQA (Salesforce via Hugging Face)
ML Framework	PyTorch + Hugging Face Transformers
Image Processing	Pillow (PIL)
OCR	Tesseract + pytesseract
TTS	gTTS (Google Text-to-Speech)
Export	ReportLab (PDF), JSON
Language	Python 3.9+

📄 License

MIT License — free to use, modify, and distribute.

🙏 Acknowledgements

Salesforce BLIP — Vision-Language Model
Hugging Face Transformers — Model hosting
Streamlit — Web framework
Google Fonts — Inter & Space Grotesk

Built with ❤️ and 🧿 | VisionAI — See Beyond the Pixels

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
voice_component		voice_component
.gitignore		.gitignore
ENHANCEMENTS.md		ENHANCEMENTS.md
README.md		README.md
VisionAI_Colab.ipynb		VisionAI_Colab.ipynb
VisionAI_Colab_Run.ipynb		VisionAI_Colab_Run.ipynb
VisionAI_Colab_tokenfree.ipynb		VisionAI_Colab_tokenfree.ipynb
app.py		app.py
classification_utils.py		classification_utils.py
export_utils.py		export_utils.py
filter_utils.py		filter_utils.py
generate_standalone_builder.py		generate_standalone_builder.py
image_utils.py		image_utils.py
model_utils.py		model_utils.py
ocr_utils.py		ocr_utils.py
requirements.txt		requirements.txt
run_visionai.bat		run_visionai.bat
setup_windows.bat		setup_windows.bat
similarity_utils.py		similarity_utils.py
styles.py		styles.py
tts_utils.py		tts_utils.py
ui_components.py		ui_components.py

Folders and files

Latest commit

History

Repository files navigation

🧿 VisionAI — Multimodal Visual Question Answering System

🌟 What Is VisionAI?

✨ Features

🔮 Core AI Features

🎨 UI/UX Features

🧠 Question Types Supported

🚀 Quick Start

Prerequisites

1. Clone the Repository

2. Create a Virtual Environment

3. Install Dependencies

Install Tesseract OCR (for OCR feature)

4. Run the Application

📁 Project Structure

🧪 Sample Test Questions

⚙️ Configuration

GPU Support

BLIP-2 (Larger, More Powerful Model)

🌐 Deployment

Option 1: Streamlit Community Cloud (Free)

Option 2: Hugging Face Spaces

Option 3: Local / Server Deployment

Option 4: Docker

🔧 Troubleshooting

📊 Model Information

🛠️ Tech Stack

📄 License

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages