Replace Extractous with Kreuzberg for text extraction by probird5 · Pull Request #63 · blacklanternsecurity/MANSPIDER

probird5 · 2026-01-28T19:57:50Z

Summary

Replace Extractous with Kreuzberg for text extraction

Fixes #56

Changes

man_spider/lib/parser/parser.py: Switched from extractous.Extractor to kreuzberg.extract_file_sync
pyproject.toml: Updated dependency from extractous to kreuzberg (>=0.9.0)

Test Plan

Tested against SMB share with various file types (CSV, DOC, DOCX, PDF, XLSX)
Files successfully downloaded and content matched

EDIT: Forgot to mention that antiword is no longer needed and that Kreuzberg utilizes libreoffice.

…ernsecurity#56

TheTechromancer · 2026-01-29T13:36:41Z

Let's gooo

TheTechromancer · 2026-02-05T16:25:53Z

@probird5 thanks for the PR; it's been merged indirectly into master here:

Manspider 2.0 Update #65

probird5 added 3 commits January 28, 2026 13:40

Replace Extractous with Kreuzberg for text extraction Fixes blacklant…

5da575d

…ernsecurity#56

fix python version

0b6f6db

reverted python version

a34ad2a

This was referenced Feb 1, 2026

Python 3.14 support #64

Open

Manspider 2.0 Update #65

Merged

TheTechromancer closed this Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Extractous with Kreuzberg for text extraction#63

Replace Extractous with Kreuzberg for text extraction#63
probird5 wants to merge 3 commits intoblacklanternsecurity:masterfrom
probird5:fix-kreuzberg

probird5 commented Jan 28, 2026 •

edited

Loading

Uh oh!

TheTechromancer commented Jan 29, 2026

Uh oh!

TheTechromancer commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

probird5 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTechromancer commented Jan 29, 2026

Uh oh!

TheTechromancer commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

probird5 commented Jan 28, 2026 •

edited

Loading