Use thread-local MarkItDown; add doc classifier by Balaxxe · Pull Request #6 · Balaxxe/Mistral_Markitdown

Balaxxe · 2026-05-19T19:24:19Z

Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###. names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.

Note

Medium Risk
Medium risk due to concurrency changes (removing global conversion lock) and new dynamic document-type classification that affects OCR structured output selection and can introduce extra I/O/LLM calls.

Overview
MarkItDown concurrency: switches from a single global cached MarkItDown instance + global convert lock to thread-local instances with a generation-based reset, enabling parallel convert()/convert_stream() calls.

PDF rendering: updates convert_pdf_to_images to use pdf2image.convert_from_path(..., paths_only=True) and renames on-disk outputs to the expected page_###.<ext> format to avoid large in-memory image objects.

Mistral OCR routing: adds classify_document_type() (filename + first-page text heuristics with optional LLM fallback) and threads file_path through OCR param builders so document annotation schema type can be auto-selected when structured document annotation is enabled; tests are updated accordingly.

^{Reviewed by Cursor Bugbot for commit 4293a6e. Bugbot is set up for automated code reviews on this repo. Configure here.}

Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###.<ext> names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63bd356145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T19:28:18Z

            "poppler_path": poppler_path,
            "thread_count": max(1, thread_count),
            "use_pdftocairo": config.PDF_IMAGE_USE_PDFTOCAIRO,
+            "output_file": "page",


Use per-run prefix for PDF page temp outputs

Setting output_file to the constant "page" makes convert_from_path(..., paths_only=True) collect any pre-existing files in output_folder that start with that prefix, not just files produced by the current conversion. When output_dir is reused (for example, rerunning conversion on the same document path), stale page_* files can be included, then renamed and returned as if they were current pages, producing incorrect page counts/content. Use a unique per-run prefix (or clear the directory first) before collecting paths.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-19T19:28:18Z

+        if hasattr(_markitdown_instances, "instance"):
+            delattr(_markitdown_instances, "instance")


Reset thread-local MarkItDown cache across all threads

reset_markitdown_instance() only deletes instance from the current thread’s threading.local() storage, so worker threads in a long-lived thread pool retain stale cached instances (or cached None after an init failure). After calling reset, those threads will continue using old state and may never retry initialization, which breaks the reset API’s expected behavior in multithreaded runs.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.}

cursor · 2026-05-19T20:29:09Z

+        ):
+            logger.debug("Classified %s as 'financial_statement' via page 1 text", file_path.name)
+            return "financial_statement"
+        if any(w in first_text_lower for w in ["form ", "w-9", "w-2", "tax return", "filer"]):


Text heuristic "form " matches inside common words

Medium Severity

The substring check "form " in first_text_lower matches inside common English words like "platform ", "perform ", "transform ", "reform ", "conform ", and "inform " because "form " is a substring of each. Any document whose first-page text contains these everyday words gets misclassified as "form" instead of falling through to "generic" or the LLM fallback. This causes the wrong document annotation schema to be selected during OCR processing.

^{Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.}

cursor · 2026-05-19T20:29:09Z

+            target_path = output_dir / f"page_{i:03d}.{file_extension}"
+            if temp_path.exists() and temp_path != target_path:
+                temp_path.replace(target_path)
+            image_paths.append(target_path)


PDF page path appended even when file missing

Low Severity

When temp_path.exists() is False, no rename occurs but target_path is still unconditionally appended to image_paths. The function then returns a list containing paths to files that don't exist on disk. Downstream consumers iterating over these paths (e.g., to open images for OCR) would encounter FileNotFoundError.

^{Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.}

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread mistral_converter.py

Comment thread local_converter.py

Fix document classification cache invalidation bugs

f6e00c3

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread local_converter.py Outdated

Remove thread-local MarkItDown double-check

8fea2dd

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread mistral_converter.py Outdated

Avoid unused document type classification

4293a6e

cursor Bot reviewed May 19, 2026

View reviewed changes

Balaxxe merged commit bc6d760 into main May 19, 2026
12 checks passed

Balaxxe deleted the Dev branch May 19, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use thread-local MarkItDown; add doc classifier#6

Use thread-local MarkItDown; add doc classifier#6
Balaxxe merged 4 commits into
mainfrom
Dev

Balaxxe commented May 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 19, 2026

Uh oh!

cursor Bot May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if hasattr(_markitdown_instances, "instance"):
		delattr(_markitdown_instances, "instance")

Conversation

Balaxxe commented May 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

Text heuristic "form " matches inside common words

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

PDF page path appended even when file missing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Balaxxe commented May 19, 2026 •

edited by cursor Bot

Loading