Security Policy

Supported Versions

Version	Supported
3.x	Yes
< 3.0	No

Reporting a Vulnerability

If you discover a security vulnerability, please report it responsibly:

Do not open a public GitHub issue for security vulnerabilities.
Email the maintainers or use GitHub Security Advisories to report privately.
Include steps to reproduce, impact assessment, and any suggested fixes.
You will receive an acknowledgment within 48 hours and a detailed response within 7 days.

Threat Model

Trust Boundaries

Boundary	Examples
Attacker-controlled inputs	Document files in `input/` or passed programmatically; document URLs for QnA; content returned by Mistral OCR/QnA; streams passed to `convert_stream_with_markitdown`.
Operator-controlled inputs	`.env` configuration, API keys, feature flags (plugins, structured output, image extraction), external binary paths (Poppler/Ghostscript), batch/job IDs, and file selection.
Developer-controlled inputs	Source code, tests, build scripts, dependency versions.

Assumptions

Typical deployment is a single-user CLI on a workstation, not a multi-tenant service.
OS file permissions protect input/, output_*/, cache/, and .env.
Mistral APIs are trusted infrastructure, but data returned by OCR/QnA must be treated as untrusted.
The tool does not sandbox untrusted file parsing. If used as a backend service, it should run in a container or restricted-user environment.

Untrusted documents and parsers

Document conversion depends on MarkItDown, pdfplumber, Poppler (pdf2image), and optionally third-party MarkItDown plugins. Hostile files are untrusted inputs to that entire parsing stack.

Prefer running conversions in a sandbox or dedicated OS user when the input is not fully trusted.
Keep MARKITDOWN_ENABLE_PLUGINS=false unless you trust every installed plugin (see config.validate_configuration() and the SECURITY: warnings it emits).
Keep MARKITDOWN_KEEP_DATA_URIS=false when generated Markdown may be rendered in a browser without sanitization; embedded data URIs widen the XSS review surface.
Review validate_configuration() output in CI or at startup for plugin, data-URI, and signed-URL expiry guidance.

Security Controls

API Key Handling

Never commit API keys to version control.
Store your MISTRAL_API_KEY in a .env file (already in .gitignore).
Use environment variables or a secrets manager in production/CI environments.
The application loads keys via python-dotenv and never logs or prints them.

File Input Validation

File paths are validated before processing (utils.validate_file): must exist, be a file, be non-empty, have an extension in the relevant allowlist, and meet a mode-specific size cap (e.g. MarkItDown vs Mistral OCR vs QnA vs smart-mode union cap).
MarkItDown path: Files exceeding MARKITDOWN_MAX_FILE_SIZE_MB (default: 100 MB) are rejected.
Mistral OCR path: Files exceeding MISTRAL_OCR_MAX_FILE_SIZE_MB (default: 200 MB) are rejected before upload.
Document QnA: Files exceeding MISTRAL_QNA_MAX_FILE_SIZE_MB (default: 50 MB) are rejected.
PDF table extraction and PDF-to-images: Skipped when the PDF exceeds max(MARKITDOWN_MAX_FILE_SIZE_MB, MISTRAL_OCR_MAX_FILE_SIZE_MB) (see config.pdf_heavy_work_max_file_size_mb() / utils.pdf_exceeds_heavy_work_limit) to avoid expensive local work on files that would fail size checks on the conversion path.

File Upload Security

Documents uploaded for OCR are sent to Mistral's servers via their Files API.
Uploaded files are subject to Mistral's data retention policy.
The application auto-deletes uploaded files after a configurable retention period (UPLOAD_RETENTION_DAYS, default: 7 days).
Review Mistral's terms of service before processing sensitive or regulated documents.

URL Validation (SSRF Protection)

All document URLs (for QnA and streaming) are validated before use:

Only HTTPS URLs are accepted.
URLs with embedded credentials are rejected.
Private/internal network addresses are blocked (RFC 1918, link-local, loopback, cloud metadata endpoints including 169.254.169.254).
IPv6-mapped IPv4 addresses are checked for private ranges.
DNS resolution is verified with a 5-second timeout to prevent DNS rebinding stalling.
MISTRAL_DOCUMENT_URL_STRICT_DNS: When set to true, URLs that fail local DNS resolution or time out are rejected (fail closed). Default is false, which allows URLs through if resolution fails locally (same tradeoff as documented below).

Known limitation (TOCTOU): The local DNS resolution check cannot fully prevent DNS rebinding attacks because Mistral's servers independently resolve the hostname when fetching the document. The local check remains valuable as a first-pass filter against obvious internal targets. For high-security deployments, restrict QnA to pre-uploaded files (via query_document_file) rather than arbitrary URLs.

Batch Job ID Validation

Batch job IDs entered interactively are validated against ^[a-zA-Z0-9_\-]{1,128}$ to prevent injection.

Output Filename Safety

utils.safe_output_stem derives output filenames to prevent path traversal and collisions. Files from outside the standard input directory receive a SHA-256-based hash suffix.

Repository / IDE configuration

If .cursor/ (or similar IDE automation) is committed for team sharing, review changes like any other config: avoid machine-specific paths, and keep secrets in .env or other ignored files (for example .env.mcp.local).

Resource Limits and Cost Guardrails

The following limits prevent runaway API spend and resource exhaustion:

Setting	Default	Enforcement
`MARKITDOWN_MAX_FILE_SIZE_MB`	100	Hard reject before local conversion
`MISTRAL_OCR_MAX_FILE_SIZE_MB`	200	Hard reject before Mistral upload
`MISTRAL_QNA_MAX_FILE_SIZE_MB`	50	Hard reject before Document QnA upload
`MAX_BATCH_FILES`	100	Hard reject in smart, OCR-only, and batch modes
`MAX_PAGES_PER_SESSION`	1000	Hard reject (refuses further OCR after limit)
`MAX_CONCURRENT_FILES`	5	Thread pool cap for parallel processing
`MISTRAL_BATCH_TIMEOUT_HOURS`	24	Batch job auto-cancellation
`UPLOAD_RETENTION_DAYS`	7	Auto-cleanup of uploaded files on Mistral

Output Safety

Generated Markdown May Contain Untrusted Content

OCR output and QnA answers are derived from document content that may include:

Arbitrary HTML tags or fragments
Data URIs (data:image/...) when MARKITDOWN_KEEP_DATA_URIS=true
JavaScript or event handlers embedded in HTML-like content

If you render output Markdown in a web browser, you must sanitize it first (e.g., using a library like DOMPurify or bleach). Failing to do so may result in XSS vulnerabilities.

YAML Frontmatter

Metadata strings in YAML frontmatter are escaped via json.dumps to prevent injection of arbitrary YAML.

Terminal Output

QnA answers and other untrusted text are sanitized to strip ANSI escape sequences and non-printable control characters before display in the terminal (utils.sanitize_for_terminal).

Cache Security

Cache entries are keyed by SHA-256 hash of file contents, making collisions impractical.
Cache writes are atomic (write to temp file, then os.replace) to prevent partial/corrupt entries under concurrency.
Cache reads validate the JSON schema: required keys (timestamp, type, data) must be present and the type must match. Corrupt or tampered entries are automatically removed.
The in-memory hash memo is bounded (1000 entries) to prevent memory exhaustion in long-running processes.

Recommendation: Protect the cache/ directory with filesystem permissions (mode 0o700 on POSIX, which is set automatically). A local attacker with write access to cache files could inject tampered OCR results, causing integrity failures in downstream processing.

Signed URL Security

Signed URLs are generated with a configurable TTL (MISTRAL_SIGNED_URL_EXPIRY, default: 1 hour).
Anyone with a signed URL can access the corresponding document until the URL expires.
Batch JSONL files contain signed URLs and are written with restrictive permissions (0o600 on POSIX).
Do not share output directories or batch JSONL files with untrusted parties.
For batch jobs, the signed URL expiry is automatically extended to exceed the batch timeout to prevent mid-job expiration.

LLM and AI-Specific Risks

Prompt Injection

Documents processed via QnA may contain text that attempts to manipulate the LLM's behavior (prompt injection). Mitigations:

A default system prompt instructs the model to answer only from document content and to ignore embedded instructions.
Operators can override this via MISTRAL_QNA_SYSTEM_PROMPT, but should preserve the anti-injection guidance.
Structured outputs use strict Pydantic/JSON schemas (schemas.py) to constrain the response format.

QnA answers should not be trusted as authoritative. Always verify critical information from LLM responses.

Cost Abuse

Crafted inputs could trigger excessive API calls (e.g., documents with many weak pages triggering re-OCR). The session page limit (MAX_PAGES_PER_SESSION) and batch file limit (MAX_BATCH_FILES) mitigate this.

Deployment Guidance

Single-User CLI (Default)

No additional hardening is required beyond standard workstation hygiene:

Keep .env readable only by your user.
Do not place untrusted files in input/ without review.
Run pip-audit periodically to check for vulnerable dependencies.

Backend Service / Multi-Tenant Use

If you wrap this tool in a web service or process untrusted uploads:

Run in a container (Docker, Podman) or as a restricted OS user with no network access beyond Mistral's API.
Apply resource limits (CPU time, memory, disk quota) at the OS or container level to guard against decompression bombs and parser exploits.
Add authentication and authorization -- the tool itself has none.
Sanitize all output before serving to browsers.
Disable plugins (MARKITDOWN_ENABLE_PLUGINS=false) and data URI preservation (MARKITDOWN_KEEP_DATA_URIS=false) to minimize attack surface.
Set tight signed URL expiry (1 hour or less) and enable upload cleanup.
Restrict QnA to file uploads (query_document_file) rather than arbitrary URLs to eliminate SSRF risk entirely.

Filesystem Permissions

On POSIX systems, the application creates directories with mode 0o700 (owner-only). On Windows, administrators should configure NTFS ACLs to restrict access to:

.env -- API keys
cache/ -- OCR results
output_md/, output_txt/, output_images/ -- converted documents
logs/ -- operational metadata and file names

Dependency Security

The CI pipeline runs pip-audit on every push and weekly. Bandit (static analysis for Python security issues) runs as a blocking CI check and as a pre-commit hook.

Run periodic dependency audits locally:

# Using pip-audit
pip install pip-audit
pip-audit

# Using safety
pip install safety
safety check

Parser Libraries

The following libraries process untrusted file content and are potential vectors for exploitation:

MarkItDown (Office, PDF, HTML, archives, audio)
pdfplumber (PDF table extraction)
pdf2image + Poppler (PDF rendering)
Pillow (image processing)

Keep these libraries up to date. In containerized deployments, use read-only filesystem mounts for binaries and restrict system calls with seccomp or AppArmor.

Best Practices

Keep dependencies up to date (pip install --upgrade -r requirements.txt).
Run make check before deploying changes (includes linting and tests).
Use the principle of least privilege for API keys -- only grant OCR access if that is all you need.
Monitor the GitHub Security Advisories page for updates.
Review the validate_configuration() output at startup for security warnings (plugins enabled, data URIs preserved, long signed URL expiry).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security

SECURITY.md

Security Policy

Supported Versions

Reporting a Vulnerability

Threat Model

Trust Boundaries

Assumptions

Untrusted documents and parsers

Security Controls

API Key Handling

File Input Validation

File Upload Security

URL Validation (SSRF Protection)

Batch Job ID Validation

Output Filename Safety

Repository / IDE configuration

Resource Limits and Cost Guardrails

Output Safety

Generated Markdown May Contain Untrusted Content

YAML Frontmatter

Terminal Output

Cache Security

Signed URL Security

LLM and AI-Specific Risks

Prompt Injection

Cost Abuse

Deployment Guidance

Single-User CLI (Default)

Backend Service / Multi-Tenant Use

Filesystem Permissions

Dependency Security

Parser Libraries

Best Practices

There aren't any published security advisories

Security: Balaxxe/Mistral_Markitdown

Security

SECURITY.md

Security Policy

Supported Versions

Reporting a Vulnerability

Threat Model

Trust Boundaries

Assumptions

Untrusted documents and parsers

Security Controls

API Key Handling

File Input Validation

File Upload Security

URL Validation (SSRF Protection)

Batch Job ID Validation

Output Filename Safety

Repository / IDE configuration

Resource Limits and Cost Guardrails

Output Safety

Generated Markdown May Contain Untrusted Content

YAML Frontmatter

Terminal Output

Cache Security

Signed URL Security

LLM and AI-Specific Risks

Prompt Injection

Cost Abuse

Deployment Guidance

Single-User CLI (Default)

Backend Service / Multi-Tenant Use

Filesystem Permissions

Dependency Security

Parser Libraries

Best Practices

There aren't any published security advisories