Add local vision/multimodal model support via mistralrs

## Summary

Add support for local vision/multimodal models to enable image understanding capabilities. This is a foundational feature that enables OCR (#TBD) and direct image analysis.

## Motivation

Journalists often work with:
- Scanned documents (PDFs that are images, not searchable text)
- Photographs of documents
- Screenshots
- Charts, graphs, and infographics

A local vision model allows extracting information from these sources without sending data to external services.

## Proposed Approach

### Model Architecture

Add a new `VisionModelInfo` struct alongside existing `LanguageModelInfo` and `EmbeddingModelInfo`:

```rust
pub struct VisionModelInfo {
    pub id: String,
    pub name: String,
    pub description: String,
    pub size_gb: f32,
    pub hf_repo_id: String,
    // Vision-specific fields
    pub supports_ocr: bool,
    pub max_image_size: (u32, u32),
}
```

### Model Candidates

| Model | Size | Notes |
|-------|------|-------|
| Qwen2-VL 7B | ~8GB | Good balance, strong multilingual |
| Qwen2.5-VL 3B | ~3.5GB | Lighter option |
| LLaVA 7B | ~7GB | Well-tested, good quality |

### Integration with mistralrs

mistralrs supports vision models through its `VisionLoaderType`. Key integration points:

1. Model loading with image processor
2. Handling image inputs alongside text
3. Managing multimodal context windows

### API Surface

```rust
// New provider capability
pub trait VisionProvider {
    async fn analyze_image(&self, image: &[u8], prompt: &str) -> Result<String>;
}

// Extend existing provider enum or add new vision provider
```

## Tasks

- [ ] Add `VisionModelInfo` struct to `models.rs`
- [ ] Define initial vision model registry (Qwen2-VL recommended)
- [ ] Implement vision model loading in provider
- [ ] Add `analyze_image` capability
- [ ] Wire up to Tauri commands
- [ ] Add UI for vision model selection in settings
- [ ] Update model download flow to handle vision model files

## Open Questions

1. Should vision and language models be loaded simultaneously, or swap as needed?
2. Memory constraints - vision models are typically larger. Should we unload language model when using vision?
3. Which model should be the default? Qwen2-VL 7B offers good quality but is large.

## Related

- Blocked by: mistralrs vision model support (verify current status)
- Enables: OCR pipeline, image evidence support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local vision/multimodal model support via mistralrs #11

Summary

Motivation

Proposed Approach

Model Architecture

Model Candidates

Integration with mistralrs

API Surface

Tasks

Open Questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Size	Notes
Qwen2-VL 7B	~8GB	Good balance, strong multilingual
Qwen2.5-VL 3B	~3.5GB	Lighter option
LLaVA 7B	~7GB	Well-tested, good quality

Add local vision/multimodal model support via mistralrs #11

Description

Summary

Motivation

Proposed Approach

Model Architecture

Model Candidates

Integration with mistralrs

API Surface

Tasks

Open Questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions