Summary
Add support for local vision/multimodal models to enable image understanding capabilities. This is a foundational feature that enables OCR (#TBD) and direct image analysis.
Motivation
Journalists often work with:
- Scanned documents (PDFs that are images, not searchable text)
- Photographs of documents
- Screenshots
- Charts, graphs, and infographics
A local vision model allows extracting information from these sources without sending data to external services.
Proposed Approach
Model Architecture
Add a new VisionModelInfo struct alongside existing LanguageModelInfo and EmbeddingModelInfo:
pub struct VisionModelInfo {
pub id: String,
pub name: String,
pub description: String,
pub size_gb: f32,
pub hf_repo_id: String,
// Vision-specific fields
pub supports_ocr: bool,
pub max_image_size: (u32, u32),
}
Model Candidates
| Model |
Size |
Notes |
| Qwen2-VL 7B |
~8GB |
Good balance, strong multilingual |
| Qwen2.5-VL 3B |
~3.5GB |
Lighter option |
| LLaVA 7B |
~7GB |
Well-tested, good quality |
Integration with mistralrs
mistralrs supports vision models through its VisionLoaderType. Key integration points:
- Model loading with image processor
- Handling image inputs alongside text
- Managing multimodal context windows
API Surface
// New provider capability
pub trait VisionProvider {
async fn analyze_image(&self, image: &[u8], prompt: &str) -> Result<String>;
}
// Extend existing provider enum or add new vision provider
Tasks
Open Questions
- Should vision and language models be loaded simultaneously, or swap as needed?
- Memory constraints - vision models are typically larger. Should we unload language model when using vision?
- Which model should be the default? Qwen2-VL 7B offers good quality but is large.
Related
- Blocked by: mistralrs vision model support (verify current status)
- Enables: OCR pipeline, image evidence support
Summary
Add support for local vision/multimodal models to enable image understanding capabilities. This is a foundational feature that enables OCR (#TBD) and direct image analysis.
Motivation
Journalists often work with:
A local vision model allows extracting information from these sources without sending data to external services.
Proposed Approach
Model Architecture
Add a new
VisionModelInfostruct alongside existingLanguageModelInfoandEmbeddingModelInfo:Model Candidates
Integration with mistralrs
mistralrs supports vision models through its
VisionLoaderType. Key integration points:API Surface
Tasks
VisionModelInfostruct tomodels.rsanalyze_imagecapabilityOpen Questions
Related