Skip to content

[Feature Request] Integrate Google Magika for AI-powered file type detection & malware defense #890

@mrgoonie

Description

@mrgoonie

Feature Request: Integrate Magika for AI-powered file type detection & malware defense

Problem

GoClaw agents frequently handle file uploads, read/write operations, and process user-provided files across skills (docx, pdf, xlsx, pptx, etc.). Currently, file type detection relies on file extensions or basic magic bytes, which is vulnerable to:

  • Extension spoofing: A .pdf file that's actually an executable
  • Polyglot files: Files valid in multiple formats, potentially hiding malicious payloads
  • MIME type confusion: Incorrect content-type leading to wrong processing pipeline
  • Malware injection: Malicious files disguised as benign documents passing through skill scripts

This is especially critical given GoClaw's multi-tenant architecture where agents process files from untrusted sources.

Proposed Solution

Integrate Google Magika — an AI-powered file content type detection tool — into GoClaw's file handling pipeline.

What is Magika?

  • AI-powered: Deep learning model trained on ~100M samples across 200+ content types
  • ~99% accuracy: Outperforms traditional file command and magic-byte detection, especially on textual content
  • Fast: ~5ms inference time per file (near-constant, independent of file size)
  • Lightweight: Model weighs only a few MBs
  • Production-proven: Used at scale by Google (Gmail, Drive, Safe Browsing), VirusTotal, and abuse.ch — processing hundreds of billions of samples weekly
  • Apache 2.0 license: Permissive, suitable for integration
  • Multiple interfaces: CLI (Rust), Python API, JS/TS, Go bindings (WIP)

Integration Points in GoClaw

1. File Upload / Ingestion Gate

User uploads file → Magika scan → Verify actual type matches expected type → Accept or reject
  • Validate files before they enter any skill processing pipeline
  • Block mismatches (e.g., PE binary uploaded as .docx)

2. Skill Pre-flight Check

Before skill scripts execute, verify input files are the expected type:

{
  "path": "uploads/user_file.docx",
  "expected": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "actual": "application/x-dosexec",
  "action": "block"
}

3. Security Layer (5-Layer Security Model)

Add Magika as an additional layer in GoClaw's security architecture:

  • Layer 0: Input validation
  • Layer 1: File type verification (Magika) ← NEW
  • Layer 2: Content sanitization
  • Layer 3: Execution sandboxing
  • Layer 4: Output validation

4. magika Binary as System Dependency

Add magika to the package installer (dep_installer.go) as a recognized system binary, similar to ffmpeg, tesseract, pandoc:

apk add magika  # or install via pip: pipx install magika

Implementation Options

Option Pros Cons
CLI integration (call magika binary from Go) Simple, no Go dependency, works immediately Process spawn overhead
Go bindings (when available) Native, fastest, no subprocess Go bindings still WIP
Python API (via existing Python skill runtime) Available now, well-documented Requires Python runtime
HTTP microservice (sidecar) Language-agnostic, scalable Adds infrastructure complexity

Recommended: Start with CLI integration (option 1) for immediate value, migrate to Go bindings when stable.

Configuration

{
  "security": {
    "magika": {
      "enabled": true,
      "mode": "high-confidence",
      "block_on_mismatch": true,
      "allowed_types": ["document", "code", "text", "image"],
      "blocked_types": ["executable", "archive", "inode"],
      "max_file_size_mb": 50
    }
  }
}

Use Cases

  • Skill security: Ensure PDF/docx/xlsx skills only receive valid files of the expected type
  • Upload validation: Reject spoofed files at the gateway before they reach agents
  • Audit logging: Log file type detection results for compliance and forensics
  • Malware prevention: Catch disguised executables, scripts, or polyglot files
  • Multi-tenant isolation: Prevent cross-tenant file type attacks in shared environments

References


Labels: enhancement, security, malware-protection, file-handling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions