A document conversion system combining Microsoft MarkItDown (local) with Mistral AI OCR (cloud) for optimal document processing. Features smart auto-routing, advanced PDF table extraction, intelligent caching, and concurrent batch processing.
- Python 3.10+
- Mistral API key (optional — only needed for cloud OCR/QnA/Batch features): https://console.mistral.ai/api-keys/
- Without a key, local MarkItDown conversion (mode 2) and PDF-to-images (mode 4) work fully.
- A valid API key is enough for single-file OCR and Document QnA.
- Batch OCR additionally requires Mistral AI Studio Scale / paid access. A valid key alone is not enough.
- If batch submit still returns free-trial / 402 messaging after a plan change, confirm the workspace is on Scale and create a fresh API key.
python3 -m venv env
source env/bin/activate # Windows: env\Scripts\activate
pip install -r requirements.txtOr install as a package:
pip install .Or use the platform scripts:
# macOS/Linux
chmod +x scripts/quick_start.sh && ./scripts/quick_start.sh
# Windows
scripts\run_converter.batOptional extras (audio transcription, YouTube, markitdown-ocr; see requirements-optional.txt):
pip install -r requirements-optional.txtCreate a .env file in the project root (see .env.example for all options):
MISTRAL_API_KEY="your_api_key_here"Place files in input/, then run:
python3 main.py # Interactive menu
python3 main.py --mode smart # CLI: auto-route by file type
python3 main.py --test # Verify setup| # | Mode | API? | Description |
|---|---|---|---|
| 1 | Convert (Smart) | If key set | Auto-picks MarkItDown or Mistral OCR per file type. PDFs also get table extraction. |
| 2 | Convert (MarkItDown) | No | Force local conversion. Fast, free, supports 30+ formats. |
| 3 | Convert (Mistral OCR) | Yes | Force cloud OCR. Best for scanned docs, complex layouts, equations. |
| 4 | PDF to Images | No | Render each PDF page to PNG/JPEG at configurable DPI. |
| 5 | Document QnA | Yes | Ask questions about a document in natural language (advisory for exact values). |
| 6 | Batch OCR | Yes | Submit to Mistral Batch API at reduced cost (requires AI Studio Scale). |
| 7 | System Status | No | Cache stats, config info, optional feature readiness, diagnostics. |
| 8 | Maintenance | No | Clear expired cache, clean up old Mistral uploads. |
Smart mode prints its routing decisions before processing:
Routing plan:
scan.pdf -> Mistral OCR (scanned + table extraction)
report.docx -> MarkItDown (local)
notes.txt -> MarkItDown (local)
All conversion modes process multiple files concurrently when more than one is selected.
python3 main.py --mode smart # Smart auto-routing (recommended)
python3 main.py --mode markitdown # Force MarkItDown
python3 main.py --mode mistral_ocr # Force Mistral OCR
python3 main.py --mode pdf_to_images
python3 main.py --mode qna
python3 main.py --mode batch_ocr
python3 main.py --mode status
python3 main.py --mode maintenance # Clear cache and old uploads
python3 main.py --no-interactive # Process all files in input/ without prompts| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, RTF, MSG |
| Web | HTML, HTM, XML, RSS |
| Data | CSV, JSON, TXT |
| Images | PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP, AVIF |
| Books | EPUB |
| Notebooks | IPYNB (Jupyter) |
| Archives | ZIP (recursive extraction) |
| Audio | MP3, WAV, M4A, FLAC (requires plugins + ffmpeg) |
- MarkItDown -- fast, local, free. Handles standard document formats natively.
- Mistral OCR -- AI-powered cloud OCR via the Files API with signed URLs. Understands complex layouts, tables, equations, multi-column text with ~95% accuracy.
- Route each file to MarkItDown or Mistral OCR based on content analysis (text layer detection for PDFs, extension for other types)
- For PDFs: run pdfplumber-based multi-strategy table extraction with automatic post-processing
- OCR quality assessment (0-100 scoring) with automatic weak page re-processing
- Results cached by SHA-256 content hash (24-hour TTL, second run = $0)
- Caching: SHA-256 file hashing with 24-hour persistence. Reprocessing the same files costs nothing.
- Batch OCR: Significant cost reduction for 10+ documents via Mistral Batch API. Requires AI Studio Scale / paid access.
- Auto-cleanup: Old uploaded files removed from Mistral after 7 days (configurable).
Advanced multi-strategy extraction for any tabular data:
- pdfplumber line-based (grid tables) + pdfplumber text-based (borderless tables)
- Automatic post-processing: split-header repair, merged cell detection, page artifact removal, cross-page table coalescing
- Financial extras: merged currency cell splitting (
"$ 1,234.56 $ 5,678.90"→ two cells), month header normalization - Deduplication and cross-page table coalescing
Automated 0-100 scoring evaluates every OCR result:
- Text density, token uniqueness, repeated phrase detection, average line length (aggregate stats still report digit counts)
- Pages scoring below threshold are automatically re-processed
- Quality score included in output for transparency
- Thread-safe Mistral client singleton ensures safe concurrent usage
Configure via .env:
ENABLE_OCR_QUALITY_ASSESSMENT=true
ENABLE_OCR_WEAK_PAGE_IMPROVEMENT=trueExtract structured JSON from documents using predefined schemas:
MISTRAL_ENABLE_STRUCTURED_OUTPUT=true
MISTRAL_DOCUMENT_SCHEMA_TYPE=auto # invoice, financial_statement, contract, form, generic
MISTRAL_ENABLE_BBOX_ANNOTATION=false
MISTRAL_ENABLE_DOCUMENT_ANNOTATION=falseBuilt-in schemas for invoices, financial statements, contracts, forms, and generic documents. Custom schemas can be added in schemas.py.
Interactive natural language queries against document content:
Important caveat: QnA works well for summaries and exploratory questions, but do not trust it blindly for exact-value extraction. For dates, amounts, invoice numbers, IDs, or compliance-sensitive fields, use OCR markdown/metadata as the source of truth and treat QnA as advisory only.
python3 main.py --mode qna
# Select a file, then ask questions interactivelyUses Mistral chat completion with document_url content type. Configurable model, system prompt, and page/image limits.
When using public URL mode, the app performs client-side HTTPS/DNS validation as a best-effort guard only (it cannot prevent DNS rebinding or all SSRF cases). Prefer uploading local files for QnA, or restrict network egress, in high-assurance environments.
For real-time token-by-token output, use the streaming variant:
from mistral_converter import query_document_stream
success, stream, error = query_document_stream(
"https://arxiv.org/pdf/1805.04770",
"What is the main contribution of this paper?"
)
if success:
for chunk in stream:
if chunk.data.choices and chunk.data.choices[0].delta.content:
print(chunk.data.choices[0].delta.content, end="", flush=True)
print()The interactive QnA mode (mode 5) uses streaming by default.
Submit 10+ documents to the Mistral Batch API at reduced cost. After submission, the CLI emits a machine-readable BATCH_JOB_ID=<id> line for easy integration with automation scripts, in addition to the human-readable confirmation.
python3 main.py --mode batch_ocr --batch-action submit --no-interactive
# Output includes: BATCH_JOB_ID=<your-job-id>System Status is available as interactive menu option 7, or from the CLI via python3 main.py --mode status (alias: --test). It reports optional feature readiness alongside configuration and cache stats:
Optional Features:
* ffmpeg: Available
* pydub: Available
* youtube_transcript_api: Not installed (needed for YouTube transcripts)
* olefile: Available
MarkItDown can use Mistral's vision models for AI-powered image descriptions within documents:
MARKITDOWN_ENABLE_LLM_DESCRIPTIONS=false
MARKITDOWN_LLM_MODEL=pixtral-large-latest| File | Purpose |
|---|---|
requirements.txt |
Core: MarkItDown, Mistral SDK, Pydantic, pdfplumber, pdf2image, Pillow |
requirements-dev.txt |
Dev: pytest, flake8, black, isort, pip-audit |
requirements-optional.txt |
Optional: audio transcription, YouTube, OpenAI client, markitdown-ocr |
| Binary | Required For | Install |
|---|---|---|
| Poppler | PDF to images | brew install poppler / apt install poppler-utils / Windows binary |
| ffmpeg | Audio transcription | brew install ffmpeg / apt install ffmpeg |
| ExifTool | EXIF metadata extraction | Optional, set MARKITDOWN_EXIFTOOL_PATH |
On Windows, set the Poppler path in .env:
POPPLER_PATH="C:/path/to/poppler/bin"output_md/ # Markdown files (.md)
output_txt/ # Plain text exports (.txt)
output_images/ # Extracted images and PDF page renders
logs/ # Processing logs and batch metadata
cache/ # OCR result cache (SHA-256 indexed)
All settings are in .env. See .env.example for the complete reference with 90+ documented options.
For the full configuration guide: CONFIGURATION.md
| Guide | Description |
|---|---|
| CONFIGURATION.md | Complete configuration reference |
| ARCHITECTURE.md | Architecture and design details |
| KNOWN_ISSUES.md | Known issues, limitations, troubleshooting |
| CONTRIBUTING.md | Development setup and contribution guidelines |
| CHANGELOG.md | Release history and version changes |
| SECURITY.md | Security policy and vulnerability reporting |
- MarkItDown:
>=0.1.5(https://github.com/microsoft/markitdown) - Mistral Python SDK:
==2.1.3(https://github.com/mistralai/client-python) - Mistral OCR docs: https://docs.mistral.ai/capabilities/document_ai/basic_ocr/
- Mistral Batch API: https://docs.mistral.ai/capabilities/batch/
See LICENSE.