Add `defuddle` extractor plugin as an alternative to Readability/Mercury by Copilot · Pull Request #4 · ArchiveBox/abx-plugins

Copilot · 2026-03-03T19:46:18Z

Review new PR feedback and determine required test changes
Inspect CI workflow status/logs and current defuddle/readability test patterns
Replace mocked defuddle extraction test with real binary + pytest-httpserver integration test
Ensure install hook behavior is exercised in tests using real binary install path
Run targeted lint/tests for modified defuddle tests
Run code review and security checks
Reply to PR comment with commit hash

Original prompt

This section details on the original issue you should resolve

<issue_title>Add defuddle as an alternative to readabiltiy / mercury</issue_title>
<issue_description>Add a plugin similar to the readability / mercury aka postlight-parser ones, but using defuddle instead:

https://github.com/kepano/defuddle</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Add defuddle as an alternative to readabiltiy / mercury #3

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Summary by cubic

Adds a Defuddle-based content extractor plugin as an alternative to Readability/Mercury using the existing parser interface. The snapshot hook parses local SingleFile/DOM/wget HTML instead of fetching, with real-binary integration tests added.

New Features
- Parses local HTML from SingleFile/DOM/wget; fails fast if none found.
- Writes content.html, content.txt, and article.json; auto-builds missing text/HTML.
- Config via env: DEFUDDLE_ENABLED, DEFUDDLE_BINARY, DEFUDDLE_TIMEOUT (fallback to TIMEOUT), DEFUDDLE_ARGS, DEFUDDLE_ARGS_EXTRA; crawl hook emits a Binary record with npm/env providers.
- Improved errors (exit code + stderr, timeouts, disabled skip). Tests now use the real defuddle binary via npm/env and pytest-httpserver, covering install record, missing-binary, JSON extraction, and local-HTML input.

^{Written for commit 708f8ca. Summary will update on new commits.}

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

pirate · 2026-03-03T20:14:18Z

@copilot It should parse from the existing best HTML output like readability/mercury do instead of re-downloading the page from scratch using the URL. Please fix.

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Copilot · 2026-03-03T20:18:29Z

@copilot It should parse from the existing best HTML output like readability/mercury do instead of re-downloading the page from scratch using the URL. Please fix.

Implemented in 7a96232. The defuddle snapshot hook now mirrors readability behavior by discovering existing local HTML outputs (singlefile/dom/wget) and running defuddle parse <html_source> --json on that file instead of fetching via URL. I also updated tests to create local HTML input and validate this path.

pirate · 2026-03-03T20:18:36Z

@copilot your tests are completely bullshit fake tests. They need to actually install the real binary and use the real binary against a pytest-httpserver test site and assert the extracted output is correctly saved. Look at how we do pytest-httpserver based tests in the other plugins.

Hit the real code paths that users hit, do not mock or monkeypatch ANYTHING, you must test the real code paths just as a user would run them. You are not allowed to skip install / running the binary in tests, part of the functionality these plugins provide is the install process, so the tests must verify that our on_Crawl install hook works correctly just like we do in other plugins.

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Copilot · 2026-03-03T20:29:09Z

@copilot your tests are completely bullshit fake tests. They need to actually install the real binary and use the real binary against a pytest-httpserver test site and assert the extract...

Addressed in 708f8ca. I replaced the mocked extraction test with real integration coverage: tests now install the real defuddle binary via the crawl/npm install hooks, use pytest-httpserver to serve deterministic HTML, run the actual snapshot hook end-to-end, and assert content.html, content.txt, and article.json outputs from the real binary path. I also removed the PYTHONPATH override and switched the install-hook invocation to uv run.

cubic-dev-ai

No issues found across 5 files

devin-ai-integration

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-03-05T05:49:36Z

abx_plugins/plugins/defuddle/tests/test_defuddle.py

@@ -0,0 +1,277 @@
+import json


🟡 Missing tests/__init__.py breaks convention required for test discovery

The defuddle/tests/ directory is missing an __init__.py file, while every other plugin's test directory in the repository (37+ plugins) includes one. Without this file, the tests directory is not a proper Python package, which can cause import resolution failures depending on how pytest discovers and imports test modules. For example, if two different plugins have a test_foo.py without __init__.py, pytest can fail with duplicate module name collisions.

Prompt for agents

Create an empty file at abx_plugins/plugins/defuddle/tests/__init__.py to match the convention used by all other plugin test directories in the repository. Every other plugin under abx_plugins/plugins/*/tests/ has an __init__.py file.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-05T05:49:37Z

abx_plugins/plugins/defuddle/on_Snapshot__57_defuddle.py

+# dependencies = [
+#   "click",
+# ]


🟡 Script metadata declares unused click dependency while code uses argparse

The PEP 723 inline script metadata at lines 4-6 declares click as a dependency (# dependencies = ["click"]), but the actual code imports and uses argparse (line 11) — a stdlib module that requires no dependency declaration. This is inconsistent with sibling plugins like readability (on_Snapshot__56_readability.py) and mercury (on_Snapshot__57_mercury.py) which declare click/rich-click and actually use rich_click as click with @click.command() decorators. When run via uv run --script, this causes unnecessary installation of click. More importantly, this suggests the code was meant to use click (matching the established pattern) but was accidentally written with argparse instead.

Prompt for agents

In abx_plugins/plugins/defuddle/on_Snapshot__57_defuddle.py, either: Option A (preferred - match sibling plugins): Replace the argparse usage with rich_click to match the pattern in readability and mercury plugins. Change the dependency from "click" to "rich-click", replace "import argparse" with "import rich_click as click", and convert the main() function to use @click.command() and @click.option() decorators instead of argparse.ArgumentParser. Option B: If argparse is intentionally used, remove the unused "click" dependency from the inline script metadata (lines 4-6), changing it to just: # dependencies = [] or removing the dependencies key entirely.

Was this helpful? React with 👍 or 👎 to provide feedback.

Initial plan

4807e84

Copilot AI assigned Copilot and pirate Mar 3, 2026

Copilot started work on behalf of pirate March 3, 2026 19:46 View session

Copilot AI and others added 2 commits March 3, 2026 19:54

Add defuddle plugin with install/snapshot hooks and tests

5591c4e

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Address review feedback for defuddle error and text handling

bce896b

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add defuddle as an alternative to readability~~ Add defuddle extractor plugin as an alternative to Readability/Mercury Mar 3, 2026

Copilot finished work on behalf of pirate March 3, 2026 19:57

Copilot started work on behalf of pirate March 3, 2026 20:14 View session

Make defuddle parse existing local HTML source files

7a96232

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Copilot finished work on behalf of pirate March 3, 2026 20:19

Copilot started work on behalf of pirate March 3, 2026 20:19 View session

Replace defuddle mocks with real binary integration tests

708f8ca

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>

Copilot finished work on behalf of pirate March 3, 2026 20:29

pirate marked this pull request as ready for review March 5, 2026 05:46

pirate merged commit 0d25938 into main Mar 5, 2026
54 of 55 checks passed

cubic-dev-ai bot reviewed Mar 5, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `defuddle` extractor plugin as an alternative to Readability/Mercury#4

Add `defuddle` extractor plugin as an alternative to Readability/Mercury#4
pirate merged 5 commits intomainfrom
copilot/add-defuddle-plugin

Copilot AI commented Mar 3, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

pirate commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

pirate commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 5, 2026

Uh oh!

devin-ai-integration bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 3, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments on the Issue (you are @copilot in this section)

Summary by cubic

Uh oh!

pirate commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

pirate commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 3, 2026 •

edited by devin-ai-integration bot

Loading