Skip to content

vinash85/medexweb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DermGemma β€” Three-Phase Dermatology Classification Pipeline

A GPU-accelerated skin lesion classification system using MedGemma 4B, Google's medical vision-language model. The pipeline runs three sequential inference phases β€” specialist observation, clinical review, and classification β€” to produce a diagnostic code for dermoscopic images.

Built on quantized GGUF models served through llama-cpp-python with CUDA, designed to run on a single GPU inside Docker.

Architecture Overview

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚  Dermoscopic     β”‚
                         β”‚  Image (224x224) β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  PHASE 1: Specialist       β”‚
                    β”‚  (Visual MedGemma + CLIP)  β”‚
                    β”‚                            β”‚
                    β”‚  4 targeted queries:       β”‚
                    β”‚  β€’ Shape / Architecture    β”‚
                    β”‚  β€’ Pigment network         β”‚
                    β”‚  β€’ Structures              β”‚
                    β”‚  β€’ Colors                  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚ raw observations
                        β”Œβ”€β”€β”€β”€β”€β”€β”€ β–Ό ──────┐
                        β”‚  Model unload  β”‚
                        β”‚  + GC + sleep  β”‚
                        └─────── ┬ β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  PHASE 2: Comparison     β”‚
                    β”‚  (Visual MedGemma + CLIP)β”‚
                    β”‚                          β”‚
                    β”‚  Image + Phase 1 context β”‚
                    β”‚  β†’ Verify observations   β”‚
                    β”‚  β†’ Find additional signs  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚ clinical description
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€ β–Ό ──────┐
                       β”‚  Model unload   β”‚
                       β”‚  + Sentence     β”‚
                       β”‚    cleaning     β”‚
                       └──────── ┬ β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  PHASE 3: Classification β”‚
                    β”‚  (Text-only MedGemma)    β”‚
                    β”‚                          β”‚
                    β”‚  Context + cleaned desc  β”‚
                    β”‚  β†’ MEL / NV / BCC / BKL  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                         β”‚  Diagnostic   β”‚
                         β”‚  Report       β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Three Phases

Phase 1 β€” Specialist Observations

The visual model (MedGemma multimodal with CLIP projector) examines the dermoscopic image through four targeted queries, each designed to extract a specific clinical feature:

Query Expected Output
Architecture (shape) round, oval, or irregular
Pigment network yes or no
Structures dots, globules, streaks, or none
Colors comma-separated color names

Each response is cleaned β€” markdown artifacts stripped, known keywords extracted, whitespace normalized. The queries use constrained prompts (one-word or brief answers) with max_tokens=32 to keep outputs focused.

Why separate queries instead of one prompt? Single-question prompts with tight token limits produce more reliable structured output from the 4B model than asking for everything at once.

Phase 2 β€” Comparison Review

A fresh instance of the visual model receives the dermoscopic image along with Phase 1's specialist observations and a comparison prompt. The model is asked to:

  1. Compare each Phase 1 observation against what it sees in the image (consistent / inconsistent)
  2. Identify additional dermoscopic features not mentioned by Phase 1 (asymmetry, border irregularity, regression structures, vascular patterns)

The output is a multi-paragraph clinical description (~200-500 chars). This phase catches errors from Phase 1 and adds clinical context that a single-pass system would miss.

Cleaning: Before being passed to Phase 3, the description goes through sentence-level filtering that removes:

  • Refusal phrases ("I cannot", "I'm sorry", "as an AI")
  • Special token leakage (<start_of_turn>, <eos>, etc.)
  • Gibberish (repeated 3+ char patterns, excessive non-ASCII, low alphabetic ratio)

Phase 3 β€” Classification

A text-only MedGemma instance (no CLIP projector β€” lighter memory footprint) receives the Phase 1 specialist context and the cleaned Phase 2 description, then classifies the lesion into one of four codes:

Code Diagnosis
MEL Melanoma
NV Nevus (benign mole)
BCC Basal Cell Carcinoma
BKL Benign Keratosis

Classification uses max_tokens=16 and temperature=0.3 (low creativity β€” this is a categorical decision, not generation). Falls back to NV if the model doesn't produce a valid code.

Why text-only? Phase 3 doesn't need to see the image. The clinical evidence has already been extracted and verified in Phases 1-2. Using text-only inference avoids loading the 812 MB CLIP projector and eliminates visual noise from the classification decision.

Singleton Architecture & Memory Optimization

The model is loaded once and reused across all phases and images β€” no unload/reload overhead.

Model Lifecycle

Server Start
    ↓
DermPipeline.__init__()  β†’ Load MedGemma once into GPU VRAM
    ↓
_vlm instance created (singleton)
    ↓
[Image 1] Phase 1 β†’ Phase 2 β†’ Phase 3  (reuse _vlm, reset KV cache between phases)
[Image 2] Phase 1 β†’ Phase 2 β†’ Phase 3  (reuse _vlm, reset KV cache between phases)
[Image N] Phase 1 β†’ Phase 2 β†’ Phase 3  (reuse _vlm, reset KV cache between phases)
    ↓
Server Shutdown
    ↓
Model unloaded

Why Singleton?

  1. Model Loading Cost: MedGemma 4B + CLIP projector = ~4.7 GB GPU memory. Loading takes 8-15 seconds per image with traditional approach.
  2. Current Optimization: Model loaded ONCE in DermPipeline.__init__(), reused for all subsequent images.
  3. Server Integration: server.py maintains a global _pipeline singleton with thread-safe _pipeline_lock ensuring only one GPU analysis runs at a time.

KV Cache Reset Strategy

Between phases, the pipeline calls vlm.reset() (3 times per image) to flush the KV cache:

# Phase 1 β†’ Phase 2 transition
vlm.reset()  # Clear stale context from Phase 1 queries

# Phase 2 β†’ Phase 3 transition  
vlm.reset()  # Clear stale context from Phase 2 description

Why? llama-cpp-python's KV cache can retain previous context tokens, causing hallucination bleed (e.g., Phase 1 answers contaminating Phase 2 reasoning). Reset clears this without reloading the model (~50ms, no sleep overhead needed).

Performance Impact

Approach Model Load Per-Image Overhead Total (100 images)
Reload per phase 3 Γ— 8s per image 24s per image 40 minutes
Singleton + Reset (Current) 1 Γ— 8s total ~0.1s per image ~10 seconds

Savings: 99.6% reduction in model management overhead.

Model Details

MedGemma 1.5 4B IT (google/medgemma-1.5-4b-it), quantized to GGUF format:

  • medgemma.gguf β€” 3.9 GB text model (Phases 1, 2, and 3)
  • medgemma-mmproj.gguf β€” 812 MB CLIP vision projector (Phases 1 and 2 only)

Served via llama-cpp-python 0.3.16 with CUDA acceleration (n_gpu_layers=-1 = all layers on GPU).

Custom Chat Format

MedGemma defaults to a "thinking mode" that consumes tokens on internal reasoning before answering. The medgemma-direct chat format bypasses this by forcing the model to start its response with Answer::

<start_of_turn>model
Answer:

This is registered as a custom chat format handler in pipeline.py and used for all three phases.

Project Structure

β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ pipeline.py          # Core three-phase pipeline
β”‚   β”œβ”€β”€ server.py            # Flask API server (port 6565)
β”‚   β”œβ”€β”€ index.html           # Web UI
β”‚   β”œβ”€β”€ DermsGemms.csv       # Ground truth (99 images, ISIC dataset)
β”‚   β”œβ”€β”€ test_phase1.py       # Phase 1 validation (specialist queries)
β”‚   β”œβ”€β”€ test_phase2.py       # Phase 2 validation (comparison review)
β”‚   └── test_phase3.py       # Phase 3 validation (end-to-end classification)
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile           # NVIDIA CUDA 12.2, llama-cpp-python 0.3.16
β”‚   └── docker-compose.yml   # GPU-enabled service
β”œβ”€β”€ models/                  # GGUF model files (not committed)
└── data/
    └── images/              # ISIC dermoscopic images

Dataset

99 dermoscopic images from the ISIC (International Skin Imaging Collaboration) archive with ground-truth labels:

Diagnosis Count
NV (nevus) 67
MEL (melanoma) 12
BKL (benign keratosis) 11
BCC (basal cell carcinoma) 5
Other (akiec, df) 4

Each image includes a lesion attribute descriptor (e.g., "atypical pigment network", "homogenous", "gyri/ridges") used for ground-truth comparison.

Running

Docker Setup

cd docker
docker compose build
docker compose run --rm derm-mcp bash

Web Application

cd /home/project/code
python3 server.py
# β†’ http://localhost:6565

The API exposes:

  • GET /api/list-images β€” available ISIC images
  • POST /api/analyze β€” runs full 3-phase pipeline, returns AI classification + ground truth

Test Scripts

Each phase has a standalone test that validates output quality across all 99 images:

# Phase 1: specialist observation quality
python3 /home/project/code/test_phase1.py

# Phase 2: comparison review quality (97% clean rate target)
python3 /home/project/code/test_phase2.py

# Phase 3: end-to-end classification
python3 /home/project/code/test_phase3.py

Test scripts output per-image diagnostics, summary tables, flag counts, clean rates, and save results to CSV in /home/project/data/.

Validation Flags

The test scripts check for common failure modes:

Flag Meaning
guardrail Model refused to answer ("I cannot...")
gibberish_repeat Repeated token pattern (degenerate output)
special_token_leak Raw tokens like <eos> in output
thinking_leak <think> tags leaked into response
verbose_output Classification output too long (>50 chars)
multiple_codes Model output contained more than one code
cleaning_emptied Sentence filter removed all content
invalid_code Output wasn't MEL/NV/BCC/BKL

Development History

Phase 1 β€” Specialist Queries

Established the base visual model integration with MedGemma multimodal. Iterated on prompt design to get single-word structured responses. Added response cleaning (keyword extraction for structures/colors, markdown stripping). Validated across 99 images with stickiness checks (ensuring the model doesn't always answer the same thing).

Phase 2 β€” Comparison Review

Added a second visual pass that cross-references Phase 1 observations against the image. Key challenge was guardrail refusals (~3% of images) and gibberish from token repetition. Solved by simplifying the prompt framing β€” removing clinical jargon that triggered safety filters β€” and adding sentence-level output cleaning. Achieved 97% clean rate.

Phase 3 β€” Classification

Decoupled classification from the visual model. Phase 3 uses text-only MedGemma (no CLIP projector) to classify based on the accumulated clinical evidence. The two-pass test design (visual model for all images first, then text model for all classifications) avoids repeated model load/unload cycles during batch testing. Sentence-level cleaning sanitizes Phase 2 output before it reaches the classifier.

About

Webpage for MedX with lesion images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors