Skip to content

Add local vision/multimodal model support via mistralrs #11

@monneyboi

Description

@monneyboi

Summary

Add support for local vision/multimodal models to enable image understanding capabilities. This is a foundational feature that enables OCR (#TBD) and direct image analysis.

Motivation

Journalists often work with:

  • Scanned documents (PDFs that are images, not searchable text)
  • Photographs of documents
  • Screenshots
  • Charts, graphs, and infographics

A local vision model allows extracting information from these sources without sending data to external services.

Proposed Approach

Model Architecture

Add a new VisionModelInfo struct alongside existing LanguageModelInfo and EmbeddingModelInfo:

pub struct VisionModelInfo {
    pub id: String,
    pub name: String,
    pub description: String,
    pub size_gb: f32,
    pub hf_repo_id: String,
    // Vision-specific fields
    pub supports_ocr: bool,
    pub max_image_size: (u32, u32),
}

Model Candidates

Model Size Notes
Qwen2-VL 7B ~8GB Good balance, strong multilingual
Qwen2.5-VL 3B ~3.5GB Lighter option
LLaVA 7B ~7GB Well-tested, good quality

Integration with mistralrs

mistralrs supports vision models through its VisionLoaderType. Key integration points:

  1. Model loading with image processor
  2. Handling image inputs alongside text
  3. Managing multimodal context windows

API Surface

// New provider capability
pub trait VisionProvider {
    async fn analyze_image(&self, image: &[u8], prompt: &str) -> Result<String>;
}

// Extend existing provider enum or add new vision provider

Tasks

  • Add VisionModelInfo struct to models.rs
  • Define initial vision model registry (Qwen2-VL recommended)
  • Implement vision model loading in provider
  • Add analyze_image capability
  • Wire up to Tauri commands
  • Add UI for vision model selection in settings
  • Update model download flow to handle vision model files

Open Questions

  1. Should vision and language models be loaded simultaneously, or swap as needed?
  2. Memory constraints - vision models are typically larger. Should we unload language model when using vision?
  3. Which model should be the default? Qwen2-VL 7B offers good quality but is large.

Related

  • Blocked by: mistralrs vision model support (verify current status)
  • Enables: OCR pipeline, image evidence support

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions