Skip to content

Support image/video content analysis (multimodal input) #3

@l2dnjsrud

Description

@l2dnjsrud

Description

Currently PhantomCrowd only analyzes text content. Many marketing campaigns are image/video-based. Adding multimodal input would dramatically improve accuracy.

From the roadmap

This is listed in README roadmap as a planned feature.

Proposed approach

  • Accept image uploads via the campaign creation form
  • Use a vision-capable LLM (e.g., gemma4, llava) to describe the image
  • Feed the description into the existing pipeline as additional context
  • Display uploaded images in the campaign detail view

Technical notes

  • Backend: add image upload endpoint, store in data/uploads/
  • LLM: use Ollama vision model to generate description
  • Frontend: add image preview in campaign form and detail view

Difficulty

Intermediate. Requires backend + frontend + Ollama vision model integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions