Add defuddle extractor plugin as an alternative to Readability/Mercury#4
Add defuddle extractor plugin as an alternative to Readability/Mercury#4
defuddle extractor plugin as an alternative to Readability/Mercury#4Conversation
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
defuddle extractor plugin as an alternative to Readability/Mercury
|
@copilot It should parse from the existing best HTML output like readability/mercury do instead of re-downloading the page from scratch using the URL. Please fix. |
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Implemented in 7a96232. The defuddle snapshot hook now mirrors readability behavior by discovering existing local HTML outputs (singlefile/dom/wget) and running |
|
@copilot your tests are completely bullshit fake tests. They need to actually install the real binary and use the real binary against a pytest-httpserver test site and assert the extracted output is correctly saved. Look at how we do pytest-httpserver based tests in the other plugins. Hit the real code paths that users hit, do not mock or monkeypatch ANYTHING, you must test the real code paths just as a user would run them. You are not allowed to skip install / running the binary in tests, part of the functionality these plugins provide is the install process, so the tests must verify that our on_Crawl install hook works correctly just like we do in other plugins. |
Co-authored-by: pirate <511499+pirate@users.noreply.github.com>
Addressed in 708f8ca. I replaced the mocked extraction test with real integration coverage: tests now install the real |
| @@ -0,0 +1,277 @@ | |||
| import json | |||
There was a problem hiding this comment.
🟡 Missing tests/__init__.py breaks convention required for test discovery
The defuddle/tests/ directory is missing an __init__.py file, while every other plugin's test directory in the repository (37+ plugins) includes one. Without this file, the tests directory is not a proper Python package, which can cause import resolution failures depending on how pytest discovers and imports test modules. For example, if two different plugins have a test_foo.py without __init__.py, pytest can fail with duplicate module name collisions.
Prompt for agents
Create an empty file at abx_plugins/plugins/defuddle/tests/__init__.py to match the convention used by all other plugin test directories in the repository. Every other plugin under abx_plugins/plugins/*/tests/ has an __init__.py file.
Was this helpful? React with 👍 or 👎 to provide feedback.
| # dependencies = [ | ||
| # "click", | ||
| # ] |
There was a problem hiding this comment.
🟡 Script metadata declares unused click dependency while code uses argparse
The PEP 723 inline script metadata at lines 4-6 declares click as a dependency (# dependencies = ["click"]), but the actual code imports and uses argparse (line 11) — a stdlib module that requires no dependency declaration. This is inconsistent with sibling plugins like readability (on_Snapshot__56_readability.py) and mercury (on_Snapshot__57_mercury.py) which declare click/rich-click and actually use rich_click as click with @click.command() decorators. When run via uv run --script, this causes unnecessary installation of click. More importantly, this suggests the code was meant to use click (matching the established pattern) but was accidentally written with argparse instead.
Prompt for agents
In abx_plugins/plugins/defuddle/on_Snapshot__57_defuddle.py, either:
Option A (preferred - match sibling plugins): Replace the argparse usage with rich_click to match the pattern in readability and mercury plugins. Change the dependency from "click" to "rich-click", replace "import argparse" with "import rich_click as click", and convert the main() function to use @click.command() and @click.option() decorators instead of argparse.ArgumentParser.
Option B: If argparse is intentionally used, remove the unused "click" dependency from the inline script metadata (lines 4-6), changing it to just:
# dependencies = []
or removing the dependencies key entirely.
Was this helpful? React with 👍 or 👎 to provide feedback.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.
Summary by cubic
Adds a Defuddle-based content extractor plugin as an alternative to Readability/Mercury using the existing parser interface. The snapshot hook parses local SingleFile/DOM/wget HTML instead of fetching, with real-binary integration tests added.
Written for commit 708f8ca. Summary will update on new commits.