Conversation
Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###.<ext> names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63bd356145
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "poppler_path": poppler_path, | ||
| "thread_count": max(1, thread_count), | ||
| "use_pdftocairo": config.PDF_IMAGE_USE_PDFTOCAIRO, | ||
| "output_file": "page", |
There was a problem hiding this comment.
Use per-run prefix for PDF page temp outputs
Setting output_file to the constant "page" makes convert_from_path(..., paths_only=True) collect any pre-existing files in output_folder that start with that prefix, not just files produced by the current conversion. When output_dir is reused (for example, rerunning conversion on the same document path), stale page_* files can be included, then renamed and returned as if they were current pages, producing incorrect page counts/content. Use a unique per-run prefix (or clear the directory first) before collecting paths.
Useful? React with 👍 / 👎.
| if hasattr(_markitdown_instances, "instance"): | ||
| delattr(_markitdown_instances, "instance") |
There was a problem hiding this comment.
Reset thread-local MarkItDown cache across all threads
reset_markitdown_instance() only deletes instance from the current thread’s threading.local() storage, so worker threads in a long-lived thread pool retain stale cached instances (or cached None after an init failure). After calling reset, those threads will continue using old state and may never retry initialization, which breaks the reset API’s expected behavior in multithreaded runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.
| ): | ||
| logger.debug("Classified %s as 'financial_statement' via page 1 text", file_path.name) | ||
| return "financial_statement" | ||
| if any(w in first_text_lower for w in ["form ", "w-9", "w-2", "tax return", "filer"]): |
There was a problem hiding this comment.
Text heuristic "form " matches inside common words
Medium Severity
The substring check "form " in first_text_lower matches inside common English words like "platform ", "perform ", "transform ", "reform ", "conform ", and "inform " because "form " is a substring of each. Any document whose first-page text contains these everyday words gets misclassified as "form" instead of falling through to "generic" or the LLM fallback. This causes the wrong document annotation schema to be selected during OCR processing.
Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.
| target_path = output_dir / f"page_{i:03d}.{file_extension}" | ||
| if temp_path.exists() and temp_path != target_path: | ||
| temp_path.replace(target_path) | ||
| image_paths.append(target_path) |
There was a problem hiding this comment.
PDF page path appended even when file missing
Low Severity
When temp_path.exists() is False, no rename occurs but target_path is still unconditionally appended to image_paths. The function then returns a list containing paths to files that don't exist on disk. Downstream consumers iterating over these paths (e.g., to open images for OCR) would encounter FileNotFoundError.
Reviewed by Cursor Bugbot for commit 4293a6e. Configure here.


Switch MarkItDown caching to thread-local instances (threading.local) so concurrent workers can convert without serializing on a single global instance; initialization errors are remembered per-thread and reset_markitdown_instance clears the thread-local cache. Remove the global convert lock and allow concurrent convert/convert_stream calls on thread-local instances. Change PDF-to-image conversion to use convert_from_path with paths_only (write files directly to disk) and rename temporary outputs to expected page_###. names to avoid large in-memory image objects. Add classify_document_type() in mistral_converter with filename/text heuristics and optional LLM fallback, and route dynamic document schema selection through _ocr_shared_optional_params/build_ocr_process_kwargs/process_with_ocr/_prepare_batch_entries by accepting file_path and doc_type. Update tests to reflect thread-local behavior, on-disk PDF conversion (expect paths and paths_only flag), and document classification routing.
Note
Medium Risk
Medium risk due to concurrency changes (removing global conversion lock) and new dynamic document-type classification that affects OCR structured output selection and can introduce extra I/O/LLM calls.
Overview
MarkItDown concurrency: switches from a single global cached
MarkItDowninstance + global convert lock to thread-local instances with a generation-based reset, enabling parallelconvert()/convert_stream()calls.PDF rendering: updates
convert_pdf_to_imagesto usepdf2image.convert_from_path(..., paths_only=True)and renames on-disk outputs to the expectedpage_###.<ext>format to avoid large in-memory image objects.Mistral OCR routing: adds
classify_document_type()(filename + first-page text heuristics with optional LLM fallback) and threadsfile_paththrough OCR param builders so document annotation schema type can be auto-selected when structured document annotation is enabled; tests are updated accordingly.Reviewed by Cursor Bugbot for commit 4293a6e. Bugbot is set up for automated code reviews on this repo. Configure here.