VisionAI is a multimodal AI application that combines computer vision and natural language processing to let users ask questions about images in plain English and receive intelligent answers — powered by the BLIP (Bootstrapping Language-Image Pre-training) Vision-Language Model from Salesforce.
The interface is designed to look and feel like an expensive AI SaaS product, with glassmorphism design, smooth animations, and a dark-mode futuristic aesthetic.
| Feature | Description |
|---|---|
| Visual Q&A | Ask any natural-language question about an uploaded image |
| Image Captioning | Automatically generate descriptive captions for images |
| Object Detection | Detect and list objects present in the image |
| OCR Extraction | Extract visible text from images using Tesseract |
| Confidence Scores | Every answer comes with an estimated confidence score |
| Feature | Description |
|---|---|
| Glassmorphism Design | Premium dark-mode UI with blur effects and gradients |
| Animated Hero | Gradient text animations, floating chips, smooth entrance |
| Sidebar Navigation | Clean navigation between all app sections |
| Floating AI Icon | Pulsing animated assistant icon in the corner |
| Text-to-Speech | AI reads answers aloud via gTTS |
| Camera Capture | Use your webcam to capture and analyse images |
| Session History | Chat-style Q&A history maintained throughout session |
| Analytics Dashboard | Session statistics, confidence chart, performance metrics |
| Export Tools | Download session as JSON or formatted text report |
| Mobile Responsive | Works beautifully on all screen sizes |
- ✅ Yes/No questions — "Is there a dog in this image?"
- 🔢 Counting — "How many cars are visible?"
- 🎨 Color recognition — "What color is the car?"
- 🏷️ Object identification — "What is the main object?"
- 🏙️ Scene understanding — "Is this indoors or outdoors?"
- 🧑 Activity recognition — "What is the person doing?"
- 📍 Location questions — "Where does this appear to be taken?"
- Python 3.9 or higher
- pip or conda
- (Optional) CUDA-capable GPU for faster inference
git clone https://github.com/your-username/multimodal-vqa.git
cd multimodal-vqa# Using venv
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Or using conda
conda create -n visionai python=3.10
conda activate visionaipip install -r requirements.txt# Ubuntu / Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows
# Download installer from: https://github.com/tesseract-ocr/tesseract
# Add installation folder to PATHstreamlit run app.pyOpen your browser and navigate to http://localhost:8501 🎉
VisionAI/
│
├── app.py ← Main Streamlit application entry point
├── requirements.txt ← Python dependencies
├── README.md ← This file
├── .gitignore ← Git ignore rules
│
├── model_utils.py ← BLIP model loading & AI inference
├── image_utils.py ← Image preprocessing & metadata
├── ocr_utils.py ← Tesseract OCR integration
├── tts_utils.py ← Text-to-speech (gTTS + pyttsx3)
├── export_utils.py ← JSON & HTML report export logic
├── filter_utils.py ← CV filters & image enhancement algorithms
├── classification_utils.py ← Quality classification, color extraction, scene classification
├── similarity_utils.py ← Feature extraction, ORB/SSIM image comparison logic
├── ui_components.py ← HTML/CSS layout and component generator
├── styles.py ← Cyber-Neon custom dark/light theme CSS overrides
│
├── setup_windows.bat ← Automated setup script for Windows
├── run_visionai.bat ← App starter script for Windows
│
├── VisionAI_Colab_Run.ipynb ← Colab Runner notebook
├── VisionAI_Colab.zip ← Package zip for Colab environment
└── venv/ ← Virtual environment directory (ignored by git)
Try these example questions after uploading an image:
| Image Type | Question | Expected Answer |
|---|---|---|
| Street scene | "How many cars are visible?" | Number |
| Portrait | "What is the person doing?" | Activity description |
| Food photo | "What food is shown?" | Food name |
| Landscape | "Is this taken outdoors?" | "yes" |
| Text screenshot | "What does the text say?" | OCR result |
| Animals | "What animal is in the image?" | Animal name |
| Colour test | "What color is the dominant object?" | Color name |
By default the app runs on CPU. To enable GPU (CUDA):
pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118Then in utils/model_utils.py the DEVICE variable is auto-detected:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"To use BLIP-2 instead of BLIP, edit utils/model_utils.py:
# Requires ~16GB VRAM / 32GB RAM
from transformers import Blip2Processor, Blip2ForConditionalGeneration
VQA_MODEL_ID = "Salesforce/blip2-opt-2.7b"- Push your code to GitHub
- Go to share.streamlit.io
- Connect your GitHub repo
- Set main file as
app.py - Add any secrets in the Secrets manager
- Click Deploy ✓
Note: The free tier uses CPU; model loading may take ~60s on first run.
- Create a new Space at huggingface.co/spaces
- Choose Streamlit as the SDK
- Upload all project files or connect your GitHub repo
- Hugging Face will automatically install
requirements.txt - Set hardware to CPU Basic (free) or T4 GPU (paid)
Add a README.md with frontmatter:
---
title: VisionAI VQA
emoji: 🧿
colorFrom: indigo
colorTo: cyan
sdk: streamlit
sdk_version: 1.35.0
app_file: app.py
pinned: false
---# Production server with Streamlit
streamlit run app.py \
--server.port 8501 \
--server.address 0.0.0.0 \
--server.headless true \
--browser.gatherUsageStats falseFROM python:3.11-slim
RUN apt-get update && apt-get install -y \
tesseract-ocr \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]docker build -t visionai .
docker run -p 8501:8501 visionai| Issue | Solution |
|---|---|
| Model download slow | First run downloads ~1GB; subsequent runs use cache |
| OCR returns empty | Install Tesseract OS binary (see Prerequisites) |
| TTS not working | Install gTTS: pip install gTTS; requires internet |
| CUDA out of memory | Switch to torch.float32 or use CPU |
| Slow inference | Normal on CPU (~3-8s); GPU reduces to <1s |
| Port already in use | streamlit run app.py --server.port 8502 |
| Property | Value |
|---|---|
| Model | Salesforce/blip-vqa-base |
| Type | Vision-Language Model (BLIP) |
| Parameters | ~385M |
| Task | Visual Question Answering |
| Input | Image + Text Question |
| Output | Text Answer |
| License | BSD-3-Clause |
| Layer | Technology |
|---|---|
| Frontend | Streamlit + Custom CSS (Glassmorphism) |
| AI Model | BLIP VQA (Salesforce via Hugging Face) |
| ML Framework | PyTorch + Hugging Face Transformers |
| Image Processing | Pillow (PIL) |
| OCR | Tesseract + pytesseract |
| TTS | gTTS (Google Text-to-Speech) |
| Export | ReportLab (PDF), JSON |
| Language | Python 3.9+ |
MIT License — free to use, modify, and distribute.
- Salesforce BLIP — Vision-Language Model
- Hugging Face Transformers — Model hosting
- Streamlit — Web framework
- Google Fonts — Inter & Space Grotesk
