newswatch

한국어 문서

News monitoring pipeline — collect RSS feeds, extract full articles, search by meaning, and track page changes. Built entirely from QuartzUnit libraries.

flowchart LR
    A["🔗 feedkit\n444 RSS feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → markdown"]
    B -->|"markdown files"| C["🔍 embgrep\nsemantic index"]
    C -->|"tracked pages"| D["📊 diffgrab\nchange detection"]

Quick Start

pip install newswatch

# Subscribe to tech feeds from the built-in catalog
newswatch setup -c technology

# Run the full pipeline: collect → extract → index
newswatch run

# Search collected articles by meaning
newswatch search "kubernetes scaling strategies"

What It Does

Collect — Subscribes to RSS/Atom feeds via feedkit (444 curated feeds built-in)
Extract — Fetches full article content via markgrab (HTML → clean markdown)
Index — Builds a local semantic search index via embgrep (embedding-powered, no API keys)
Track — Monitors pages for changes via diffgrab (structured diffs)

No cloud services, no API keys. Everything runs locally.

CLI

`newswatch setup`

Subscribe to feeds.

newswatch setup -c technology              # all 68 tech feeds
newswatch setup -c science -c finance      # multiple categories
newswatch setup -f https://example.com/rss # individual URL

`newswatch run`

Run the full pipeline.

newswatch run                              # collect → extract → index
newswatch run -n 100                       # extract up to 100 articles
newswatch run -t https://example.com       # also track this page for changes

Output:

Running newswatch pipeline...

        Pipeline Results
┌─────────────────────┬────────┐
│ Step                │ Result │
├─────────────────────┼────────┤
│ Feeds collected     │     62 │
│ New articles        │    418 │
│ Articles extracted  │     50 │
│ Articles indexed    │     50 │
└─────────────────────┴────────┘

`newswatch search`

Semantic search across collected articles.

newswatch search "AI regulation in Europe"
newswatch search "supply chain attacks" -n 10

Python API

import asyncio
from newswatch import NewsPipeline

async def main():
    pipeline = NewsPipeline()

    # Subscribe to feeds
    await pipeline.setup(categories=["technology", "science"])

    # Run full pipeline
    result = await pipeline.run(extract_limit=100)
    print(f"{result.articles_new} new, {result.articles_indexed} indexed")

    # Semantic search
    results = pipeline.search("quantum computing breakthroughs")
    for r in results:
        print(f"  [{r['score']}] {r['text'][:80]}")

    pipeline.close()

asyncio.run(main())

How It Works

flowchart TD
    A["🔗 feedkit\n444 curated feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → clean markdown\nhttpx → Playwright fallback"]
    B -->|"markdown files"| C["🔍 embgrep\nEmbed chunks → SQLite vector index\nSmart chunking · heading-level"]
    C -->|"indexed articles"| D["📊 diffgrab\nTrack pages for changes\nStructured diffs + section analysis"]

    style A fill:#1a1a2e,stroke:#e94560,color:#fff
    style B fill:#1a1a2e,stroke:#0f3460,color:#fff
    style C fill:#1a1a2e,stroke:#533483,color:#fff
    style D fill:#1a1a2e,stroke:#e94560,color:#fff

Configuration

Data is stored in ~/.newswatch/ by default:

~/.newswatch/
├── feeds.db       # feedkit subscriptions + articles
├── index.db       # embgrep semantic index
├── tracker.db     # diffgrab snapshots
└── extracted/     # markgrab markdown output

Custom location:

pipeline = NewsPipeline(db_dir="/path/to/data")

QuartzUnit Libraries Used

Library	Role in newswatch	PyPI
feedkit	RSS/Atom feed collection (444 curated feeds)	`pip install feedkit`
markgrab	URL → LLM-ready markdown extraction	`pip install markgrab`
embgrep	Local semantic search (fastembed + SQLite)	`pip install embgrep`
diffgrab	Web page change tracking + structured diffs	`pip install diffgrab`

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
newswatch		newswatch
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
llms-full.txt		llms-full.txt
llms.txt		llms.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newswatch

Quick Start

What It Does

CLI

`newswatch setup`

`newswatch run`

`newswatch search`

Python API

How It Works

Configuration

QuartzUnit Libraries Used

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

newswatch

Quick Start

What It Does

CLI

newswatch setup

newswatch run

newswatch search

Python API

How It Works

Configuration

QuartzUnit Libraries Used

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`newswatch setup`

`newswatch run`

`newswatch search`

Packages