Skip to content

Add defuddle extractor plugin as an alternative to Readability/Mercury#4

Merged
pirate merged 5 commits intomainfrom
copilot/add-defuddle-plugin
Mar 5, 2026
Merged

Add defuddle extractor plugin as an alternative to Readability/Mercury#4
pirate merged 5 commits intomainfrom
copilot/add-defuddle-plugin

Conversation

Copy link
Contributor

Copilot AI commented Mar 3, 2026

  • Review new PR feedback and determine required test changes
  • Inspect CI workflow status/logs and current defuddle/readability test patterns
  • Replace mocked defuddle extraction test with real binary + pytest-httpserver integration test
  • Ensure install hook behavior is exercised in tests using real binary install path
  • Run targeted lint/tests for modified defuddle tests
  • Run code review and security checks
  • Reply to PR comment with commit hash
Original prompt

This section details on the original issue you should resolve

<issue_title>Add defuddle as an alternative to readabiltiy / mercury</issue_title>
<issue_description>Add a plugin similar to the readability / mercury aka postlight-parser ones, but using defuddle instead:

https://github.com/kepano/defuddle</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.


Summary by cubic

Adds a Defuddle-based content extractor plugin as an alternative to Readability/Mercury using the existing parser interface. The snapshot hook parses local SingleFile/DOM/wget HTML instead of fetching, with real-binary integration tests added.

  • New Features
    • Parses local HTML from SingleFile/DOM/wget; fails fast if none found.
    • Writes content.html, content.txt, and article.json; auto-builds missing text/HTML.
    • Config via env: DEFUDDLE_ENABLED, DEFUDDLE_BINARY, DEFUDDLE_TIMEOUT (fallback to TIMEOUT), DEFUDDLE_ARGS, DEFUDDLE_ARGS_EXTRA; crawl hook emits a Binary record with npm/env providers.
    • Improved errors (exit code + stderr, timeouts, disabled skip). Tests now use the real defuddle binary via npm/env and pytest-httpserver, covering install record, missing-binary, JSON extraction, and local-HTML input.

Written for commit 708f8ca. Summary will update on new commits.


Open with Devin

Copilot AI and others added 2 commits March 3, 2026 19:54
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Copilot AI changed the title [WIP] Add defuddle as an alternative to readability Add defuddle extractor plugin as an alternative to Readability/Mercury Mar 3, 2026
@pirate
Copy link
Member

pirate commented Mar 3, 2026

@copilot It should parse from the existing best HTML output like readability/mercury do instead of re-downloading the page from scratch using the URL. Please fix.

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 3, 2026

@copilot It should parse from the existing best HTML output like readability/mercury do instead of re-downloading the page from scratch using the URL. Please fix.

Implemented in 7a96232. The defuddle snapshot hook now mirrors readability behavior by discovering existing local HTML outputs (singlefile/dom/wget) and running defuddle parse <html_source> --json on that file instead of fetching via URL. I also updated tests to create local HTML input and validate this path.

@pirate
Copy link
Member

pirate commented Mar 3, 2026

@copilot your tests are completely bullshit fake tests. They need to actually install the real binary and use the real binary against a pytest-httpserver test site and assert the extracted output is correctly saved. Look at how we do pytest-httpserver based tests in the other plugins.

Hit the real code paths that users hit, do not mock or monkeypatch ANYTHING, you must test the real code paths just as a user would run them. You are not allowed to skip install / running the binary in tests, part of the functionality these plugins provide is the install process, so the tests must verify that our on_Crawl install hook works correctly just like we do in other plugins.

Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 3, 2026

@copilot your tests are completely bullshit fake tests. They need to actually install the real binary and use the real binary against a pytest-httpserver test site and assert the extract...

Addressed in 708f8ca. I replaced the mocked extraction test with real integration coverage: tests now install the real defuddle binary via the crawl/npm install hooks, use pytest-httpserver to serve deterministic HTML, run the actual snapshot hook end-to-end, and assert content.html, content.txt, and article.json outputs from the real binary path. I also removed the PYTHONPATH override and switched the install-hook invocation to uv run.

@pirate pirate marked this pull request as ready for review March 5, 2026 05:46
@pirate pirate merged commit 0d25938 into main Mar 5, 2026
54 of 55 checks passed
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

@@ -0,0 +1,277 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing tests/__init__.py breaks convention required for test discovery

The defuddle/tests/ directory is missing an __init__.py file, while every other plugin's test directory in the repository (37+ plugins) includes one. Without this file, the tests directory is not a proper Python package, which can cause import resolution failures depending on how pytest discovers and imports test modules. For example, if two different plugins have a test_foo.py without __init__.py, pytest can fail with duplicate module name collisions.

Prompt for agents
Create an empty file at abx_plugins/plugins/defuddle/tests/__init__.py to match the convention used by all other plugin test directories in the repository. Every other plugin under abx_plugins/plugins/*/tests/ has an __init__.py file.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +4 to +6
# dependencies = [
# "click",
# ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Script metadata declares unused click dependency while code uses argparse

The PEP 723 inline script metadata at lines 4-6 declares click as a dependency (# dependencies = ["click"]), but the actual code imports and uses argparse (line 11) — a stdlib module that requires no dependency declaration. This is inconsistent with sibling plugins like readability (on_Snapshot__56_readability.py) and mercury (on_Snapshot__57_mercury.py) which declare click/rich-click and actually use rich_click as click with @click.command() decorators. When run via uv run --script, this causes unnecessary installation of click. More importantly, this suggests the code was meant to use click (matching the established pattern) but was accidentally written with argparse instead.

Prompt for agents
In abx_plugins/plugins/defuddle/on_Snapshot__57_defuddle.py, either:

Option A (preferred - match sibling plugins): Replace the argparse usage with rich_click to match the pattern in readability and mercury plugins. Change the dependency from "click" to "rich-click", replace "import argparse" with "import rich_click as click", and convert the main() function to use @click.command() and @click.option() decorators instead of argparse.ArgumentParser.

Option B: If argparse is intentionally used, remove the unused "click" dependency from the inline script metadata (lines 4-6), changing it to just:
# dependencies = []
or removing the dependencies key entirely.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add defuddle as an alternative to readabiltiy / mercury

2 participants