Skip to content

Replace Extractous with Kreuzberg for text extraction#63

Closed
probird5 wants to merge 3 commits intoblacklanternsecurity:masterfrom
probird5:fix-kreuzberg
Closed

Replace Extractous with Kreuzberg for text extraction#63
probird5 wants to merge 3 commits intoblacklanternsecurity:masterfrom
probird5:fix-kreuzberg

Conversation

@probird5
Copy link
Contributor

@probird5 probird5 commented Jan 28, 2026

Summary

  • Replace Extractous with Kreuzberg for text extraction

Fixes #56

Changes

  • man_spider/lib/parser/parser.py: Switched from extractous.Extractor to kreuzberg.extract_file_sync
  • pyproject.toml: Updated dependency from extractous to kreuzberg (>=0.9.0)

Test Plan

  • Tested against SMB share with various file types (CSV, DOC, DOCX, PDF, XLSX)
  • Files successfully downloaded and content matched
swappy-20260128_144811

EDIT: Forgot to mention that antiword is no longer needed and that Kreuzberg utilizes libreoffice.

@TheTechromancer
Copy link
Collaborator

Let's gooo

This was referenced Feb 1, 2026
@TheTechromancer
Copy link
Collaborator

@probird5 thanks for the PR; it's been merged indirectly into master here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MANSPIDER throws ParseError error when extracting content from files

2 participants