Summary
Create a mechanism to fetch and update schemas from a remote registry, allowing schema fixes without library releases.
Design Overview
Remote Registry
Host community-maintained schemas at:
- GitHub repo:
MALathon/fetcharoo-schemas
- Or JSON endpoint:
https://fetcharoo.io/schemas/v1/registry.json
Registry Format
{
"version": "1.0.0",
"updated": "2025-01-15T00:00:00Z",
"schemas": {
"springer_book": {
"version": "1.2.0",
"url_pattern": "https?://link\\.springer\\.com/book/.*",
"description": "Springer book with chapters",
"include_patterns": ["*.pdf"],
"exclude_patterns": ["*bbm*", "*bfm*"],
"sort_by": "numeric",
"recommended_depth": 1,
"request_delay": 1.0,
"test_url": "https://link.springer.com/book/10.1007/978-3-031-41026-0",
"expected_min_pdfs": 5
}
}
}
Local Cache
~/.cache/fetcharoo/
├── schemas.json # Cached remote schemas
└── schemas.meta.json # Cache metadata (last updated, etag)
Update Flow
# fetcharoo/schemas/remote.py
REMOTE_REGISTRY_URL = "https://raw.githubusercontent.com/MALathon/fetcharoo-schemas/main/registry.json"
CACHE_DIR = Path.home() / '.cache' / 'fetcharoo'
CACHE_MAX_AGE = timedelta(days=7)
def update_schemas(force: bool = False) -> UpdateResult:
"""
Fetch latest schemas from remote registry.
Args:
force: Update even if cache is fresh
Returns:
UpdateResult with counts of new/updated schemas
"""
cache_file = CACHE_DIR / 'schemas.json'
meta_file = CACHE_DIR / 'schemas.meta.json'
# Check cache freshness
if not force and cache_file.exists():
meta = json.loads(meta_file.read_text())
cached_at = datetime.fromisoformat(meta['cached_at'])
if datetime.now() - cached_at < CACHE_MAX_AGE:
return UpdateResult(from_cache=True)
# Fetch remote
response = requests.get(REMOTE_REGISTRY_URL, timeout=30)
response.raise_for_status()
registry = response.json()
# Save cache
CACHE_DIR.mkdir(parents=True, exist_ok=True)
cache_file.write_text(json.dumps(registry))
meta_file.write_text(json.dumps({
'cached_at': datetime.now().isoformat(),
'etag': response.headers.get('etag')
}))
# Register schemas
new_count = 0
updated_count = 0
for name, schema_data in registry['schemas'].items():
existing = get_schema(name)
if existing:
# Check version
if schema_data.get('version', '0') > existing.version:
# Update (re-register)
register_schema(SiteSchema(**schema_data), overwrite=True)
updated_count += 1
else:
register_schema(SiteSchema(**schema_data))
new_count += 1
return UpdateResult(new=new_count, updated=updated_count)
def load_cached_schemas() -> int:
"""Load schemas from local cache without network request."""
cache_file = CACHE_DIR / 'schemas.json'
if not cache_file.exists():
return 0
registry = json.loads(cache_file.read_text())
count = 0
for name, schema_data in registry['schemas'].items():
if not get_schema(name): # Don't overwrite built-in
register_schema(SiteSchema(**schema_data))
count += 1
return count
CLI Commands
# Update schemas from remote
$ fetcharoo --update-schemas
Fetching schemas from remote registry...
Updated 2 schemas, added 1 new schema.
# Force update (ignore cache)
$ fetcharoo --update-schemas --force
# Use remote schemas (auto-loads from cache)
$ fetcharoo --use-remote-schemas https://example.com --schema auto
Opt-in Behavior
Remote schemas should be opt-in:
# Explicit in code
from fetcharoo.schemas import load_remote_schemas
load_remote_schemas() # Loads from cache, updates if stale
# Or via environment
FETCHAROO_USE_REMOTE_SCHEMAS=1 fetcharoo https://...
# Or CLI flag
fetcharoo --use-remote-schemas https://...
Community Contribution
Separate repo for schema contributions:
fetcharoo-schemas/
├── registry.json # Combined registry
├── schemas/
│ ├── springer.yaml
│ ├── arxiv.yaml
│ ├── ieee.yaml
│ └── ...
├── tests/ # Schema validation tests
└── .github/
└── workflows/
└── validate.yml # Auto-validate on PR
Tasks
Acceptance Criteria
- Can fetch schemas from remote URL
- Caches schemas locally with TTL
--update-schemas refreshes cache
- Built-in schemas take precedence over remote
- Works offline with cached schemas
- Clear opt-in mechanism (not auto-enabled)
Security Considerations
- Only fetch from trusted URLs (configurable)
- Validate schema format before registering
- Don't execute arbitrary code from remote schemas
- Consider signing/checksums for integrity
Dependencies
Part of
Parent issue: #10
Summary
Create a mechanism to fetch and update schemas from a remote registry, allowing schema fixes without library releases.
Design Overview
Remote Registry
Host community-maintained schemas at:
MALathon/fetcharoo-schemashttps://fetcharoo.io/schemas/v1/registry.jsonRegistry Format
{ "version": "1.0.0", "updated": "2025-01-15T00:00:00Z", "schemas": { "springer_book": { "version": "1.2.0", "url_pattern": "https?://link\\.springer\\.com/book/.*", "description": "Springer book with chapters", "include_patterns": ["*.pdf"], "exclude_patterns": ["*bbm*", "*bfm*"], "sort_by": "numeric", "recommended_depth": 1, "request_delay": 1.0, "test_url": "https://link.springer.com/book/10.1007/978-3-031-41026-0", "expected_min_pdfs": 5 } } }Local Cache
Update Flow
CLI Commands
Opt-in Behavior
Remote schemas should be opt-in:
Community Contribution
Separate repo for schema contributions:
Tasks
fetcharoo-schemasrepositoryupdate_schemas()with cachingload_cached_schemas()--update-schemasCLI command--use-remote-schemasflagFETCHAROO_USE_REMOTE_SCHEMASenv varAcceptance Criteria
--update-schemasrefreshes cacheSecurity Considerations
Dependencies
Part of
Parent issue: #10