News monitoring pipeline — collect RSS feeds, extract full articles, search by meaning, and track page changes. Built entirely from QuartzUnit libraries.
flowchart LR
A["🔗 feedkit\n444 RSS feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → markdown"]
B -->|"markdown files"| C["🔍 embgrep\nsemantic index"]
C -->|"tracked pages"| D["📊 diffgrab\nchange detection"]
pip install newswatch
# Subscribe to tech feeds from the built-in catalog
newswatch setup -c technology
# Run the full pipeline: collect → extract → index
newswatch run
# Search collected articles by meaning
newswatch search "kubernetes scaling strategies"- Collect — Subscribes to RSS/Atom feeds via feedkit (444 curated feeds built-in)
- Extract — Fetches full article content via markgrab (HTML → clean markdown)
- Index — Builds a local semantic search index via embgrep (embedding-powered, no API keys)
- Track — Monitors pages for changes via diffgrab (structured diffs)
No cloud services, no API keys. Everything runs locally.
Subscribe to feeds.
newswatch setup -c technology # all 68 tech feeds
newswatch setup -c science -c finance # multiple categories
newswatch setup -f https://example.com/rss # individual URLRun the full pipeline.
newswatch run # collect → extract → index
newswatch run -n 100 # extract up to 100 articles
newswatch run -t https://example.com # also track this page for changesOutput:
Running newswatch pipeline...
Pipeline Results
┌─────────────────────┬────────┐
│ Step │ Result │
├─────────────────────┼────────┤
│ Feeds collected │ 62 │
│ New articles │ 418 │
│ Articles extracted │ 50 │
│ Articles indexed │ 50 │
└─────────────────────┴────────┘
Semantic search across collected articles.
newswatch search "AI regulation in Europe"
newswatch search "supply chain attacks" -n 10import asyncio
from newswatch import NewsPipeline
async def main():
pipeline = NewsPipeline()
# Subscribe to feeds
await pipeline.setup(categories=["technology", "science"])
# Run full pipeline
result = await pipeline.run(extract_limit=100)
print(f"{result.articles_new} new, {result.articles_indexed} indexed")
# Semantic search
results = pipeline.search("quantum computing breakthroughs")
for r in results:
print(f" [{r['score']}] {r['text'][:80]}")
pipeline.close()
asyncio.run(main())flowchart TD
A["🔗 feedkit\n444 curated feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → clean markdown\nhttpx → Playwright fallback"]
B -->|"markdown files"| C["🔍 embgrep\nEmbed chunks → SQLite vector index\nSmart chunking · heading-level"]
C -->|"indexed articles"| D["📊 diffgrab\nTrack pages for changes\nStructured diffs + section analysis"]
style A fill:#1a1a2e,stroke:#e94560,color:#fff
style B fill:#1a1a2e,stroke:#0f3460,color:#fff
style C fill:#1a1a2e,stroke:#533483,color:#fff
style D fill:#1a1a2e,stroke:#e94560,color:#fff
Data is stored in ~/.newswatch/ by default:
~/.newswatch/
├── feeds.db # feedkit subscriptions + articles
├── index.db # embgrep semantic index
├── tracker.db # diffgrab snapshots
└── extracted/ # markgrab markdown output
Custom location:
pipeline = NewsPipeline(db_dir="/path/to/data")| Library | Role in newswatch | PyPI |
|---|---|---|
| feedkit | RSS/Atom feed collection (444 curated feeds) | pip install feedkit |
| markgrab | URL → LLM-ready markdown extraction | pip install markgrab |
| embgrep | Local semantic search (fastembed + SQLite) | pip install embgrep |
| diffgrab | Web page change tracking + structured diffs | pip install diffgrab |