Skip to content

Use thread-local MarkItDown; add doc classifier#6

Merged
Balaxxe merged 4 commits into
mainfrom
Dev
May 19, 2026
Merged

Use thread-local MarkItDown; add doc classifier#6
Balaxxe merged 4 commits into
mainfrom
Dev

Conversation

@Balaxxe
Copy link
Copy Markdown
Owner

@Balaxxe Balaxxe commented May 19, 2026

Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###. names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.


Note

Medium Risk
Medium risk due to concurrency changes (removing global conversion lock) and new dynamic document-type classification that affects OCR structured output selection and can introduce extra I/O/LLM calls.

Overview
MarkItDown concurrency: switches from a single global cached MarkItDown instance + global convert lock to thread-local instances with a generation-based reset, enabling parallel convert()/convert_stream() calls.

PDF rendering: updates convert_pdf_to_images to use pdf2image.convert_from_path(..., paths_only=True) and renames on-disk outputs to the expected page_###.<ext> format to avoid large in-memory image objects.

Mistral OCR routing: adds classify_document_type() (filename + first-page text heuristics with optional LLM fallback) and threads file_path through OCR param builders so document annotation schema type can be auto-selected when structured document annotation is enabled; tests are updated accordingly.

Reviewed by Cursor Bugbot for commit 4293a6e. Bugbot is set up for automated code reviews on this repo. Configure here.

Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###.<ext> names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63bd356145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread local_converter.py
"poppler_path": poppler_path,
"thread_count": max(1, thread_count),
"use_pdftocairo": config.PDF_IMAGE_USE_PDFTOCAIRO,
"output_file": "page",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use per-run prefix for PDF page temp outputs

Setting output_file to the constant "page" makes convert_from_path(..., paths_only=True) collect any pre-existing files in output_folder that start with that prefix, not just files produced by the current conversion. When output_dir is reused (for example, rerunning conversion on the same document path), stale page_* files can be included, then renamed and returned as if they were current pages, producing incorrect page counts/content. Use a unique per-run prefix (or clear the directory first) before collecting paths.

Useful? React with 👍 / 👎.

Comment thread local_converter.py
Comment on lines +144 to +145
if hasattr(_markitdown_instances, "instance"):
delattr(_markitdown_instances, "instance")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset thread-local MarkItDown cache across all threads

reset_markitdown_instance() only deletes instance from the current thread’s threading.local() storage, so worker threads in a long-lived thread pool retain stale cached instances (or cached None after an init failure). After calling reset, those threads will continue using old state and may never retry initialization, which breaks the reset API’s expected behavior in multithreaded runs.

Useful? React with 👍 / 👎.

Comment thread mistral_converter.py
Comment thread local_converter.py
Comment thread local_converter.py Outdated
Comment thread mistral_converter.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.

Comment thread mistral_converter.py
):
logger.debug("Classified %s as 'financial_statement' via page 1 text", file_path.name)
return "financial_statement"
if any(w in first_text_lower for w in ["form ", "w-9", "w-2", "tax return", "filer"]):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text heuristic "form " matches inside common words

Medium Severity

The substring check "form " in first_text_lower matches inside common English words like "platform ", "perform ", "transform ", "reform ", "conform ", and "inform " because "form " is a substring of each. Any document whose first-page text contains these everyday words gets misclassified as "form" instead of falling through to "generic" or the LLM fallback. This causes the wrong document annotation schema to be selected during OCR processing.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.

Comment thread local_converter.py
target_path = output_dir / f"page_{i:03d}.{file_extension}"
if temp_path.exists() and temp_path != target_path:
temp_path.replace(target_path)
image_paths.append(target_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDF page path appended even when file missing

Low Severity

When temp_path.exists() is False, no rename occurs but target_path is still unconditionally appended to image_paths. The function then returns a list containing paths to files that don't exist on disk. Downstream consumers iterating over these paths (e.g., to open images for OCR) would encounter FileNotFoundError.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.

@Balaxxe Balaxxe merged commit bc6d760 into main May 19, 2026
12 checks passed
@Balaxxe Balaxxe deleted the Dev branch May 19, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants