Skip to content

tahanawab4848/VisionAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧿 VisionAI — Multimodal Visual Question Answering System

VisionAI Banner

A production-grade, visually stunning AI web application that can see and understand images.

Streamlit Hugging Face Python PyTorch License


🌟 What Is VisionAI?

VisionAI is a multimodal AI application that combines computer vision and natural language processing to let users ask questions about images in plain English and receive intelligent answers — powered by the BLIP (Bootstrapping Language-Image Pre-training) Vision-Language Model from Salesforce.

The interface is designed to look and feel like an expensive AI SaaS product, with glassmorphism design, smooth animations, and a dark-mode futuristic aesthetic.


✨ Features

🔮 Core AI Features

Feature Description
Visual Q&A Ask any natural-language question about an uploaded image
Image Captioning Automatically generate descriptive captions for images
Object Detection Detect and list objects present in the image
OCR Extraction Extract visible text from images using Tesseract
Confidence Scores Every answer comes with an estimated confidence score

🎨 UI/UX Features

Feature Description
Glassmorphism Design Premium dark-mode UI with blur effects and gradients
Animated Hero Gradient text animations, floating chips, smooth entrance
Sidebar Navigation Clean navigation between all app sections
Floating AI Icon Pulsing animated assistant icon in the corner
Text-to-Speech AI reads answers aloud via gTTS
Camera Capture Use your webcam to capture and analyse images
Session History Chat-style Q&A history maintained throughout session
Analytics Dashboard Session statistics, confidence chart, performance metrics
Export Tools Download session as JSON or formatted text report
Mobile Responsive Works beautifully on all screen sizes

🧠 Question Types Supported

  • Yes/No questions — "Is there a dog in this image?"
  • 🔢 Counting — "How many cars are visible?"
  • 🎨 Color recognition — "What color is the car?"
  • 🏷️ Object identification — "What is the main object?"
  • 🏙️ Scene understanding — "Is this indoors or outdoors?"
  • 🧑 Activity recognition — "What is the person doing?"
  • 📍 Location questions — "Where does this appear to be taken?"

🚀 Quick Start

Prerequisites

  • Python 3.9 or higher
  • pip or conda
  • (Optional) CUDA-capable GPU for faster inference

1. Clone the Repository

git clone https://github.com/your-username/multimodal-vqa.git
cd multimodal-vqa

2. Create a Virtual Environment

# Using venv
python -m venv venv
source venv/bin/activate       # Linux/macOS
venv\Scripts\activate          # Windows

# Or using conda
conda create -n visionai python=3.10
conda activate visionai

3. Install Dependencies

pip install -r requirements.txt

Install Tesseract OCR (for OCR feature)

# Ubuntu / Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download installer from: https://github.com/tesseract-ocr/tesseract
# Add installation folder to PATH

4. Run the Application

streamlit run app.py

Open your browser and navigate to http://localhost:8501 🎉


📁 Project Structure

VisionAI/
│
├── app.py                    ← Main Streamlit application entry point
├── requirements.txt          ← Python dependencies
├── README.md                 ← This file
├── .gitignore                ← Git ignore rules
│
├── model_utils.py            ← BLIP model loading & AI inference
├── image_utils.py            ← Image preprocessing & metadata
├── ocr_utils.py              ← Tesseract OCR integration
├── tts_utils.py              ← Text-to-speech (gTTS + pyttsx3)
├── export_utils.py           ← JSON & HTML report export logic
├── filter_utils.py           ← CV filters & image enhancement algorithms
├── classification_utils.py   ← Quality classification, color extraction, scene classification
├── similarity_utils.py       ← Feature extraction, ORB/SSIM image comparison logic
├── ui_components.py          ← HTML/CSS layout and component generator
├── styles.py                 ← Cyber-Neon custom dark/light theme CSS overrides
│
├── setup_windows.bat         ← Automated setup script for Windows
├── run_visionai.bat          ← App starter script for Windows
│
├── VisionAI_Colab_Run.ipynb  ← Colab Runner notebook
├── VisionAI_Colab.zip        ← Package zip for Colab environment
└── venv/                     ← Virtual environment directory (ignored by git)

🧪 Sample Test Questions

Try these example questions after uploading an image:

Image Type Question Expected Answer
Street scene "How many cars are visible?" Number
Portrait "What is the person doing?" Activity description
Food photo "What food is shown?" Food name
Landscape "Is this taken outdoors?" "yes"
Text screenshot "What does the text say?" OCR result
Animals "What animal is in the image?" Animal name
Colour test "What color is the dominant object?" Color name

⚙️ Configuration

GPU Support

By default the app runs on CPU. To enable GPU (CUDA):

pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Then in utils/model_utils.py the DEVICE variable is auto-detected:

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

BLIP-2 (Larger, More Powerful Model)

To use BLIP-2 instead of BLIP, edit utils/model_utils.py:

# Requires ~16GB VRAM / 32GB RAM
from transformers import Blip2Processor, Blip2ForConditionalGeneration

VQA_MODEL_ID = "Salesforce/blip2-opt-2.7b"

🌐 Deployment

Option 1: Streamlit Community Cloud (Free)

  1. Push your code to GitHub
  2. Go to share.streamlit.io
  3. Connect your GitHub repo
  4. Set main file as app.py
  5. Add any secrets in the Secrets manager
  6. Click Deploy

Note: The free tier uses CPU; model loading may take ~60s on first run.

Option 2: Hugging Face Spaces

  1. Create a new Space at huggingface.co/spaces
  2. Choose Streamlit as the SDK
  3. Upload all project files or connect your GitHub repo
  4. Hugging Face will automatically install requirements.txt
  5. Set hardware to CPU Basic (free) or T4 GPU (paid)

Add a README.md with frontmatter:

---
title: VisionAI VQA
emoji: 🧿
colorFrom: indigo
colorTo: cyan
sdk: streamlit
sdk_version: 1.35.0
app_file: app.py
pinned: false
---

Option 3: Local / Server Deployment

# Production server with Streamlit
streamlit run app.py \
  --server.port 8501 \
  --server.address 0.0.0.0 \
  --server.headless true \
  --browser.gatherUsageStats false

Option 4: Docker

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]
docker build -t visionai .
docker run -p 8501:8501 visionai

🔧 Troubleshooting

Issue Solution
Model download slow First run downloads ~1GB; subsequent runs use cache
OCR returns empty Install Tesseract OS binary (see Prerequisites)
TTS not working Install gTTS: pip install gTTS; requires internet
CUDA out of memory Switch to torch.float32 or use CPU
Slow inference Normal on CPU (~3-8s); GPU reduces to <1s
Port already in use streamlit run app.py --server.port 8502

📊 Model Information

Property Value
Model Salesforce/blip-vqa-base
Type Vision-Language Model (BLIP)
Parameters ~385M
Task Visual Question Answering
Input Image + Text Question
Output Text Answer
License BSD-3-Clause

🛠️ Tech Stack

Layer Technology
Frontend Streamlit + Custom CSS (Glassmorphism)
AI Model BLIP VQA (Salesforce via Hugging Face)
ML Framework PyTorch + Hugging Face Transformers
Image Processing Pillow (PIL)
OCR Tesseract + pytesseract
TTS gTTS (Google Text-to-Speech)
Export ReportLab (PDF), JSON
Language Python 3.9+

📄 License

MIT License — free to use, modify, and distribute.


🙏 Acknowledgements


Built with ❤️ and 🧿 | VisionAI — See Beyond the Pixels

About

VisionAI turns raw images into actionable insights with a sleek, professional UI—perfect for demos, research prototypes, or as the foundation for a commercial computer‑vision product. Feel free to fork, extend, and star the repo! 🌟

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors