Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
e743369
lots of fixes
pirate Feb 26, 2026
9c4caf5
cleanup readme
pirate Feb 26, 2026
f2a5e1e
more chrome util deduping
pirate Feb 26, 2026
007c5ac
fix papersdl assertions
pirate Feb 26, 2026
532baa2
cleanup model_rebuilds
pirate Feb 26, 2026
fe96c9a
cleanup model_rebuilds
pirate Feb 26, 2026
9fdfc71
more test fixes
pirate Feb 26, 2026
57b4c74
more chrome utils and test improvements
pirate Feb 26, 2026
35e552d
more chrome utils and test improvements
pirate Feb 26, 2026
5cb0866
cleanup fixtures for pytest
pirate Feb 26, 2026
94b748d
explicitly add fixtures to tests that need them
pirate Feb 26, 2026
b0a99f2
use real urls for dns test
pirate Feb 26, 2026
2f09cbf
captcha test tweaks
pirate Feb 26, 2026
54f3b11
test fixes
pirate Feb 28, 2026
2167523
format
pirate Feb 28, 2026
75218bc
Update abx_plugins/plugins/gallerydl/tests/test_gallerydl.py
pirate Feb 28, 2026
170a39f
Update abx_plugins/plugins/singlefile/on_Snapshot__50_singlefile.py
pirate Feb 28, 2026
45cb68b
Update abx_plugins/plugins/singlefile/singlefile_extension_save.js
pirate Feb 28, 2026
1048604
Merge branch 'main' into refactors
pirate Feb 28, 2026
b38fefc
cubic fixes
pirate Feb 28, 2026
617333b
fix parallel tests
pirate Feb 28, 2026
80bebe0
fix missing dir and replace requests with stdlib
pirate Feb 28, 2026
bf20563
fix hooks and abx-pkg version
pirate Feb 28, 2026
7c32880
fix python version
pirate Feb 28, 2026
2a335cd
bump python version
pirate Feb 28, 2026
55415ca
bump plugins version
pirate Feb 28, 2026
59758bc
env fixes for tests
pirate Feb 28, 2026
16154c0
more test fixes
pirate Feb 28, 2026
399ab47
test fixes
pirate Feb 28, 2026
d1f3f29
env var fixes
pirate Feb 28, 2026
80bacc4
make more tests static
pirate Feb 28, 2026
843ae52
more fixes
pirate Feb 28, 2026
558fc30
mercury improvement
pirate Feb 28, 2026
1baa20b
formatting
pirate Feb 28, 2026
729c0a5
fix wget and headers
pirate Feb 28, 2026
8596571
fix seo test determinism
pirate Feb 28, 2026
092fbc6
fix tests
pirate Feb 28, 2026
a5c0360
more consolidation of plugin chrome uitls
pirate Feb 28, 2026
b6e1fbf
test fixes
pirate Feb 28, 2026
eab1f72
more consolidation of plugin chrome uitls
pirate Mar 1, 2026
78f0285
fix timeout
pirate Mar 1, 2026
b1538c1
more extension fixes
pirate Mar 1, 2026
4566301
fix timeout probe
pirate Mar 1, 2026
ac85528
make ytdlp test deterministic
pirate Mar 1, 2026
d69d969
Update abx_plugins/plugins/favicon/on_Snapshot__11_favicon.bg.py
pirate Mar 1, 2026
ccdbe3f
cubic comments
pirate Mar 1, 2026
91548aa
lint fixes
pirate Mar 1, 2026
f47ab41
fix missing import
pirate Mar 1, 2026
95839b3
fix race on chrome tab setup
pirate Mar 1, 2026
0cff700
allow env provider for wget test
pirate Mar 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions .github/workflows/test-parallel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:

plugin=$(echo $test_file | sed 's|abx_plugins/plugins/\([^/]*\)/.*|\1|')
test_name=$(basename $test_file .py | sed 's/^test_//')
name="plugin/$plugin/$test_name"
name="$test_name"

json_array+="{\"path\":\"$test_file\",\"name\":\"$name\"}"
done
Expand Down Expand Up @@ -93,13 +93,20 @@ jobs:

- uses: awalsh128/cache-apt-pkgs-action@latest
with:
packages: git ripgrep build-essential python3-dev python3-setuptools libssl-dev libldap2-dev libsasl2-dev zlib1g-dev libatomic1 python3-minimal gnupg2 curl wget python3-ldap python3-msgpack python3-mutagen python3-regex python3-pycryptodome procps
packages: git wget ripgrep build-essential python3-dev python3-setuptools libssl-dev libldap2-dev libsasl2-dev zlib1g-dev libatomic1 python3-minimal gnupg2 curl wget python3-ldap python3-msgpack python3-mutagen python3-regex python3-pycryptodome procps
version: 1.1

- name: Install dependencies with uv
run: |
uv venv
uv sync --dev --all-extras
uv pip install -e ".[dev]"

- name: Run test - ${{ matrix.test.name }}
run: |
uv run pytest -xvs "${{ matrix.test.path }}" --basetemp=tests/out
uv run pytest -xvs "${{ matrix.test.path }}" --basetemp="$RUNNER_TEMP/pytest-out"
env:
TWOCAPTCHA_API_KEY: ${{ secrets.TWOCAPTCHA_API_KEY }}
CHROME_ARGS_EXTRA: '["--no-sandbox"]'
CHROME_HEADLESS: "True"
CHROME_BINARY: "/usr/bin/chromium"
81 changes: 77 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# abx-plugins

ArchiveBox-compatible plugin suite (hooks, config schemas, binaries manifests).
ArchiveBox-compatible plugin suite (hooks and config schemas).

This package contains only plugin assets and a tiny helper to locate them.
It does **not** depend on Django or ArchiveBox.
Expand All @@ -11,7 +11,7 @@ It does **not** depend on Django or ArchiveBox.
from abx_plugins import get_plugins_dir

plugins_dir = get_plugins_dir()
# scan plugins_dir for plugins/*/config.json, binaries.jsonl, on_* hooks
# scan plugins_dir for plugins/*/config.json and on_* hooks
```

Tools like `abx-dl` and ArchiveBox can discover plugins from this package
Expand All @@ -23,8 +23,9 @@ without symlinks or environment-variable tricks.

Each plugin lives under `plugins/<name>/` and may include:

- `config.json` (optional) - config schema
- `on_*` hook scripts (required to do work)
- `config.json` config schema
- `on_Crawl__...` per-crawl hook scripts (optional) - install dependencies / set up shared resources
- `on_Snapshot__...` per-snapshot hooks - for each URL: do xyz...

Hooks run with:

Expand All @@ -42,6 +43,78 @@ Hooks run with:
- `PERSONAS_DIR` - persona profiles root (default: `~/.config/abx/personas`)
- `ACTIVE_PERSONA` - persona name (default: `Default`)

### Install hook contract (concise)

Lifecycle:

1. `on_Crawl__*install*` declares crawl dependencies.
2. `on_Binary__*install*` resolves/installs one binary with one provider.

`on_Crawl` output (dependency declaration):

```json
{"type":"Binary","name":"yt-dlp","binproviders":"pip,brew,apt,env","overrides":{"pip":{"packages":["yt-dlp[default]"]}},"machine_id":"<optional>"}
```

`on_Binary` input/output:

- CLI input should accept `--binary-id`, `--machine-id`, `--name` (plus optional provider args).
- Output should emit installed facts like:

```json
{"type":"Binary","name":"yt-dlp","abspath":"/abs/path","version":"2025.01.01","sha256":"<optional>","binprovider":"pip","machine_id":"<recommended>","binary_id":"<recommended>"}
```

Optional machine patch record:

```json
{"type":"Machine","config":{"PATH":"...","NODE_MODULES_DIR":"...","CHROME_BINARY":"..."}}
```

Semantics:

- `stdout`: JSONL records only
- `stderr`: human logs/debug
- exit `0`: success or intentional skip
- exit non-zero: hard failure

State/OS:

- working dir: `CRAWL_DIR/<plugin>/`
- durable install root: `LIB_DIR` (e.g. npm prefix, pip venv, puppeteer cache)
- providers: `apt` (Debian/Ubuntu), `brew` (macOS/Linux), many hooks currently assume POSIX paths

### Snapshot hook contract (concise)

Lifecycle:

- runs once per snapshot, typically after crawl setup
- common Chrome flow: crawl browser/session -> `chrome_tab` -> `chrome_navigate` -> downstream extractors

State:

- output cwd is usually `SNAP_DIR/<plugin>/`
- hooks may read sibling outputs via `../<plugin>/...`

Output records:

- terminal record is usually:

```json
{"type":"ArchiveResult","status":"succeeded|skipped|failed","output_str":"path-or-message"}
```

- discovery hooks may also emit `Snapshot` and `Tag` records before `ArchiveResult`
- search indexing hooks are a known exception and may use exit code + stderr without `ArchiveResult`

Semantics:

- `stdout`: JSONL records
- `stderr`: diagnostics/logging
- exit `0`: succeeded or skipped
- exit non-zero: failed
- current nuance: some skip/transient paths emit no JSONL and rely only on exit code

### Event JSONL interface (bbus-style, no dependency)

Hooks emit JSONL events to stdout. They do **not** need to import `bbus`.
Expand Down
3 changes: 1 addition & 2 deletions abx_plugins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@
from __future__ import annotations

from pathlib import Path
from importlib import resources


def get_plugins_dir() -> Path:
"""Return the filesystem path to the bundled plugins directory."""
return Path(resources.files(__name__) / "plugins")
return Path(__file__).resolve().parent / "plugins"


__all__ = ["get_plugins_dir"]
113 changes: 21 additions & 92 deletions abx_plugins/plugins/accessibility/on_Snapshot__39_accessibility.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,14 @@ const path = require('path');
// Add NODE_MODULES_DIR to module resolution paths if set
if (process.env.NODE_MODULES_DIR) module.paths.unshift(process.env.NODE_MODULES_DIR);
const puppeteer = require('puppeteer-core');
const {
getEnvBool,
getEnvInt,
parseArgs,
readCdpUrl,
connectToPage,
waitForPageLoaded,
} = require('../chrome/chrome_utils.js');

// Extractor metadata
const PLUGIN_NAME = 'accessibility';
Expand All @@ -32,100 +40,27 @@ if (!fs.existsSync(OUTPUT_DIR)) {
process.chdir(OUTPUT_DIR);
const OUTPUT_FILE = 'accessibility.json';
const CHROME_SESSION_DIR = '../chrome';
const CHROME_SESSION_REQUIRED_ERROR = 'No Chrome session found (chrome plugin must run first)';

// Parse command line arguments
function parseArgs() {
const args = {};
process.argv.slice(2).forEach(arg => {
if (arg.startsWith('--')) {
const [key, ...valueParts] = arg.slice(2).split('=');
args[key.replace(/-/g, '_')] = valueParts.join('=') || true;
}
});
return args;
}

// Get environment variable with default
function getEnv(name, defaultValue = '') {
return (process.env[name] || defaultValue).trim();
}

function getEnvBool(name, defaultValue = false) {
const val = getEnv(name, '').toLowerCase();
if (['true', '1', 'yes', 'on'].includes(val)) return true;
if (['false', '0', 'no', 'off'].includes(val)) return false;
return defaultValue;
}

// Wait for chrome tab to be fully loaded
async function waitForChromeTabLoaded(timeoutMs = 60000) {
const navigationFile = path.join(CHROME_SESSION_DIR, 'navigation.json');
const startTime = Date.now();

while (Date.now() - startTime < timeoutMs) {
if (fs.existsSync(navigationFile)) {
return true;
}
// Wait 100ms before checking again
await new Promise(resolve => setTimeout(resolve, 100));
}

return false;
}

// Get CDP URL from chrome plugin
function getCdpUrl() {
const cdpFile = path.join(CHROME_SESSION_DIR, 'cdp_url.txt');
if (fs.existsSync(cdpFile)) {
return fs.readFileSync(cdpFile, 'utf8').trim();
}
return null;
}

function assertChromeSession() {
const cdpFile = path.join(CHROME_SESSION_DIR, 'cdp_url.txt');
const targetIdFile = path.join(CHROME_SESSION_DIR, 'target_id.txt');
const pidFile = path.join(CHROME_SESSION_DIR, 'chrome.pid');
if (!fs.existsSync(cdpFile) || !fs.existsSync(targetIdFile) || !fs.existsSync(pidFile)) {
throw new Error(CHROME_SESSION_REQUIRED_ERROR);
}
try {
const pid = parseInt(fs.readFileSync(pidFile, 'utf8').trim(), 10);
if (!pid || Number.isNaN(pid)) throw new Error('Invalid pid');
process.kill(pid, 0);
} catch (e) {
throw new Error(CHROME_SESSION_REQUIRED_ERROR);
}
const cdpUrl = getCdpUrl();
if (!cdpUrl) {
throw new Error(CHROME_SESSION_REQUIRED_ERROR);
}
return cdpUrl;
}

// Extract accessibility info
async function extractAccessibility(url) {
async function extractAccessibility(url, timeoutMs) {
// Output directory is current directory (hook already runs in output dir)
const outputPath = path.join(OUTPUT_DIR, OUTPUT_FILE);

let browser = null;

try {
// Connect to existing Chrome session
const cdpUrl = assertChromeSession();
if (!readCdpUrl(CHROME_SESSION_DIR)) {
return { success: false, error: 'No Chrome session found (chrome plugin must run first)' };
}

browser = await puppeteer.connect({
browserWSEndpoint: cdpUrl,
const connection = await connectToPage({
chromeSessionDir: CHROME_SESSION_DIR,
timeoutMs,
puppeteer,
});

// Get the page
const pages = await browser.pages();
const page = pages.find(p => p.url().startsWith('http')) || pages[0];

if (!page) {
return { success: false, error: 'No page found in Chrome session' };
}
browser = connection.browser;
const page = connection.page;
await waitForPageLoaded(CHROME_SESSION_DIR, timeoutMs * 4, 200);

// Get accessibility snapshot
const accessibilityTree = await page.accessibility.snapshot({ interestingOnly: true });
Expand Down Expand Up @@ -250,14 +185,8 @@ async function main() {
process.exit(0);
}

// Check if Chrome session exists, then wait for page load
assertChromeSession();
const pageLoaded = await waitForChromeTabLoaded(60000);
if (!pageLoaded) {
throw new Error('Page not loaded after 60s (chrome_navigate must complete first)');
}

const result = await extractAccessibility(url);
const timeoutMs = getEnvInt('ACCESSIBILITY_TIMEOUT', getEnvInt('TIMEOUT', 30)) * 1000;
const result = await extractAccessibility(url, timeoutMs);

if (result.success) {
status = 'succeeded';
Expand Down
Loading