A GPU-accelerated skin lesion classification system using MedGemma 4B, Google's medical vision-language model. The pipeline runs three sequential inference phases β specialist observation, clinical review, and classification β to produce a diagnostic code for dermoscopic images.
Built on quantized GGUF models served through llama-cpp-python with CUDA, designed to run on a single GPU inside Docker.
βββββββββββββββββββ
β Dermoscopic β
β Image (224x224) β
ββββββββββ¬βββββββββ
β
βββββββββββββββΌββββββββββββββ
β PHASE 1: Specialist β
β (Visual MedGemma + CLIP) β
β β
β 4 targeted queries: β
β β’ Shape / Architecture β
β β’ Pigment network β
β β’ Structures β
β β’ Colors β
βββββββββββββββ¬ββββββββββββββ
β raw observations
ββββββββ βΌ βββββββ
β Model unload β
β + GC + sleep β
ββββββββ β¬ βββββββ
ββββββββββββββΌβββββββββββββ
β PHASE 2: Comparison β
β (Visual MedGemma + CLIP)β
β β
β Image + Phase 1 context β
β β Verify observations β
β β Find additional signs β
ββββββββββββββ¬βββββββββββββ
β clinical description
βββββββββ βΌ βββββββ
β Model unload β
β + Sentence β
β cleaning β
βββββββββ β¬ βββββββ
ββββββββββββββΌβββββββββββββ
β PHASE 3: Classification β
β (Text-only MedGemma) β
β β
β Context + cleaned desc β
β β MEL / NV / BCC / BKL β
ββββββββββββββ¬βββββββββββββ
β
βββββββββΌββββββββ
β Diagnostic β
β Report β
βββββββββββββββββ
The visual model (MedGemma multimodal with CLIP projector) examines the dermoscopic image through four targeted queries, each designed to extract a specific clinical feature:
| Query | Expected Output |
|---|---|
| Architecture (shape) | round, oval, or irregular |
| Pigment network | yes or no |
| Structures | dots, globules, streaks, or none |
| Colors | comma-separated color names |
Each response is cleaned β markdown artifacts stripped, known keywords extracted, whitespace normalized. The queries use constrained prompts (one-word or brief answers) with max_tokens=32 to keep outputs focused.
Why separate queries instead of one prompt? Single-question prompts with tight token limits produce more reliable structured output from the 4B model than asking for everything at once.
A fresh instance of the visual model receives the dermoscopic image along with Phase 1's specialist observations and a comparison prompt. The model is asked to:
- Compare each Phase 1 observation against what it sees in the image (consistent / inconsistent)
- Identify additional dermoscopic features not mentioned by Phase 1 (asymmetry, border irregularity, regression structures, vascular patterns)
The output is a multi-paragraph clinical description (~200-500 chars). This phase catches errors from Phase 1 and adds clinical context that a single-pass system would miss.
Cleaning: Before being passed to Phase 3, the description goes through sentence-level filtering that removes:
- Refusal phrases ("I cannot", "I'm sorry", "as an AI")
- Special token leakage (
<start_of_turn>,<eos>, etc.) - Gibberish (repeated 3+ char patterns, excessive non-ASCII, low alphabetic ratio)
A text-only MedGemma instance (no CLIP projector β lighter memory footprint) receives the Phase 1 specialist context and the cleaned Phase 2 description, then classifies the lesion into one of four codes:
| Code | Diagnosis |
|---|---|
| MEL | Melanoma |
| NV | Nevus (benign mole) |
| BCC | Basal Cell Carcinoma |
| BKL | Benign Keratosis |
Classification uses max_tokens=16 and temperature=0.3 (low creativity β this is a categorical decision, not generation). Falls back to NV if the model doesn't produce a valid code.
Why text-only? Phase 3 doesn't need to see the image. The clinical evidence has already been extracted and verified in Phases 1-2. Using text-only inference avoids loading the 812 MB CLIP projector and eliminates visual noise from the classification decision.
The model is loaded once and reused across all phases and images β no unload/reload overhead.
Server Start
β
DermPipeline.__init__() β Load MedGemma once into GPU VRAM
β
_vlm instance created (singleton)
β
[Image 1] Phase 1 β Phase 2 β Phase 3 (reuse _vlm, reset KV cache between phases)
[Image 2] Phase 1 β Phase 2 β Phase 3 (reuse _vlm, reset KV cache between phases)
[Image N] Phase 1 β Phase 2 β Phase 3 (reuse _vlm, reset KV cache between phases)
β
Server Shutdown
β
Model unloaded
- Model Loading Cost: MedGemma 4B + CLIP projector = ~4.7 GB GPU memory. Loading takes 8-15 seconds per image with traditional approach.
- Current Optimization: Model loaded ONCE in
DermPipeline.__init__(), reused for all subsequent images. - Server Integration:
server.pymaintains a global_pipelinesingleton with thread-safe_pipeline_lockensuring only one GPU analysis runs at a time.
Between phases, the pipeline calls vlm.reset() (3 times per image) to flush the KV cache:
# Phase 1 β Phase 2 transition
vlm.reset() # Clear stale context from Phase 1 queries
# Phase 2 β Phase 3 transition
vlm.reset() # Clear stale context from Phase 2 descriptionWhy? llama-cpp-python's KV cache can retain previous context tokens, causing hallucination bleed (e.g., Phase 1 answers contaminating Phase 2 reasoning). Reset clears this without reloading the model (~50ms, no sleep overhead needed).
| Approach | Model Load | Per-Image Overhead | Total (100 images) |
|---|---|---|---|
| Reload per phase | 3 Γ 8s per image | 24s per image | 40 minutes |
| Singleton + Reset (Current) | 1 Γ 8s total | ~0.1s per image | ~10 seconds |
Savings: 99.6% reduction in model management overhead.
MedGemma 1.5 4B IT (google/medgemma-1.5-4b-it), quantized to GGUF format:
medgemma.ggufβ 3.9 GB text model (Phases 1, 2, and 3)medgemma-mmproj.ggufβ 812 MB CLIP vision projector (Phases 1 and 2 only)
Served via llama-cpp-python 0.3.16 with CUDA acceleration (n_gpu_layers=-1 = all layers on GPU).
MedGemma defaults to a "thinking mode" that consumes tokens on internal reasoning before answering. The medgemma-direct chat format bypasses this by forcing the model to start its response with Answer::
<start_of_turn>model
Answer:
This is registered as a custom chat format handler in pipeline.py and used for all three phases.
βββ code/
β βββ pipeline.py # Core three-phase pipeline
β βββ server.py # Flask API server (port 6565)
β βββ index.html # Web UI
β βββ DermsGemms.csv # Ground truth (99 images, ISIC dataset)
β βββ test_phase1.py # Phase 1 validation (specialist queries)
β βββ test_phase2.py # Phase 2 validation (comparison review)
β βββ test_phase3.py # Phase 3 validation (end-to-end classification)
βββ docker/
β βββ Dockerfile # NVIDIA CUDA 12.2, llama-cpp-python 0.3.16
β βββ docker-compose.yml # GPU-enabled service
βββ models/ # GGUF model files (not committed)
βββ data/
βββ images/ # ISIC dermoscopic images
99 dermoscopic images from the ISIC (International Skin Imaging Collaboration) archive with ground-truth labels:
| Diagnosis | Count |
|---|---|
| NV (nevus) | 67 |
| MEL (melanoma) | 12 |
| BKL (benign keratosis) | 11 |
| BCC (basal cell carcinoma) | 5 |
| Other (akiec, df) | 4 |
Each image includes a lesion attribute descriptor (e.g., "atypical pigment network", "homogenous", "gyri/ridges") used for ground-truth comparison.
cd docker
docker compose build
docker compose run --rm derm-mcp bashcd /home/project/code
python3 server.py
# β http://localhost:6565The API exposes:
GET /api/list-imagesβ available ISIC imagesPOST /api/analyzeβ runs full 3-phase pipeline, returns AI classification + ground truth
Each phase has a standalone test that validates output quality across all 99 images:
# Phase 1: specialist observation quality
python3 /home/project/code/test_phase1.py
# Phase 2: comparison review quality (97% clean rate target)
python3 /home/project/code/test_phase2.py
# Phase 3: end-to-end classification
python3 /home/project/code/test_phase3.pyTest scripts output per-image diagnostics, summary tables, flag counts, clean rates, and save results to CSV in /home/project/data/.
The test scripts check for common failure modes:
| Flag | Meaning |
|---|---|
guardrail |
Model refused to answer ("I cannot...") |
gibberish_repeat |
Repeated token pattern (degenerate output) |
special_token_leak |
Raw tokens like <eos> in output |
thinking_leak |
<think> tags leaked into response |
verbose_output |
Classification output too long (>50 chars) |
multiple_codes |
Model output contained more than one code |
cleaning_emptied |
Sentence filter removed all content |
invalid_code |
Output wasn't MEL/NV/BCC/BKL |
Established the base visual model integration with MedGemma multimodal. Iterated on prompt design to get single-word structured responses. Added response cleaning (keyword extraction for structures/colors, markdown stripping). Validated across 99 images with stickiness checks (ensuring the model doesn't always answer the same thing).
Added a second visual pass that cross-references Phase 1 observations against the image. Key challenge was guardrail refusals (~3% of images) and gibberish from token repetition. Solved by simplifying the prompt framing β removing clinical jargon that triggered safety filters β and adding sentence-level output cleaning. Achieved 97% clean rate.
Decoupled classification from the visual model. Phase 3 uses text-only MedGemma (no CLIP projector) to classify based on the accumulated clinical evidence. The two-pass test design (visual model for all images first, then text model for all classifications) avoids repeated model load/unload cycles during batch testing. Sentence-level cleaning sanitizes Phase 2 output before it reaches the classifier.