מדד האכיפה - Enforcement Index
denbust is evolving from a single-purpose news scanner into a small multi-dataset platform for TFHT
data jobs. Phase A introduced the shared platform spine. Phase B turns the first real dataset,
news_items, into an end-to-end operational flow with:
- normalized metadata records
- Supabase operational persistence
- privacy/review/suppression gating
- weekly public release bundle generation
- publication hooks for Kaggle and Hugging Face
- latest-backup upload hooks for Google Drive and S3-compatible object storage
Today, the implemented dataset/jobs are:
news_items / ingestnews_items / releasenews_items / backup
Planned future datasets:
docs_metadataopen_docs_fulltextevents
- Scans Israeli news sources for enforcement activity: raids, arrests, closures, trafficking cases
- Uses RSS and browser-backed scrapers
- Classifies relevance with an LLM
- Deduplicates the same story across multiple sources
- Emits unified items via CLI or SMTP email for the ingest workflow
- Persists normalized
news_itemsoperational rows - Builds metadata-only weekly release bundles
- Publishes release bundles to Kaggle and Hugging Face when configured
- Uploads the latest release bundle to Google Drive and S3-compatible object storage when configured
- Persists dataset/job-scoped seen state and per-run JSON snapshots
pip install -e ".[dev]"
python -m playwright install chromium
denbust scan --config agents/news/local.yaml
denbust release --config agents/release/news_items.yaml
denbust backup --config agents/backup/news_items.yamlTo send reports by email, set output.format: email in your config and provide SMTP env vars
from .env.example.
Mako scraping uses a headless Chromium browser. After installing dependencies on a new machine, run
python -m playwright install chromium once before your first live scan.
Phase A introduces explicit dataset and job identity in config and run snapshots:
dataset_namejob_name
Current defaults remain:
dataset_name: news_itemsjob_name: ingest
denbust scan is preserved as a compatibility alias for news_items / ingest.
Future-facing commands now exist as real news_items jobs:
denbust run --dataset news_items --job ingest --config agents/news/local.yaml
denbust release --dataset news_items --config agents/release/news_items.yaml
denbust backup --dataset news_items --config agents/backup/news_items.yamlLocal runs now use dataset/job-namespaced defaults under the repo-local state root:
- seen store:
data/news_items/ingest/seen.json - run snapshots:
data/news_items/ingest/runs/ - publication scaffold dir:
data/news_items/ingest/publication/
Example:
denbust scan --config agents/news/local.yamlYou can override the persistence layout without changing YAML by setting:
DENBUST_STATE_ROOTDENBUST_STORE_PATHDENBUST_RUNS_DIR
Precedence rules:
DENBUST_STORE_PATH/DENBUST_RUNS_DIRDENBUST_STATE_ROOT- explicit YAML store paths /
store.state_root - local default root
data/
Scheduled GitHub Actions runs use this repo as the code runner and a separate repo,
tfht_enforce_idx_state, as the canonical mutable state store.
The workflow:
- checks out this repo
- checks out the state repo into
state_repo/ - sets dataset/job env such as
DATASET_NAME=news_itemsandJOB_NAME=ingest - runs
denbust scan --config agents/news/github.yaml - points persistence at the checked-out state repo via
DENBUST_STATE_ROOT=state_repo - commits and pushes the updated namespaced state files only if files changed
Required secrets for GitHub-run mode:
ANTHROPIC_API_KEYSTATE_REPO_PATDENBUST_EMAIL_SMTP_HOSTDENBUST_EMAIL_SMTP_PORTDENBUST_EMAIL_SMTP_USERNAMEDENBUST_EMAIL_SMTP_PASSWORDDENBUST_EMAIL_FROMDENBUST_EMAIL_TODENBUST_EMAIL_USE_TLSDENBUST_EMAIL_SUBJECT
Expected tfht_enforce_idx_state structure:
tfht_enforce_idx_state/
└── news_items/
├── ingest/
│ ├── seen.json
│ ├── runs/
│ └── publication/
├── release/
│ ├── runs/
│ └── publication/
└── backup/
├── runs/
└── publication/
Bootstrap notes:
seen.jsonmay be absent initially; it is created once a run marks at least one URL as seenruns/andpublication/directories are created automatically by the workflows when needed- a small
README.mdin the state repo is fine but optional
Phase A introduces shared platform primitives so future dataset jobs can reuse them:
src/denbust/models/- dataset/job identity
- run snapshot model
- policy enums for rights, privacy, review, and publication status
src/denbust/store/state_paths.py- centralized dataset/job state path resolution
src/denbust/datasets/- explicit dataset/job registry
src/denbust/ops/storage.py- operational-store abstraction with a null implementation
src/denbust/publish/- release/export abstractions
- backup abstractions
What is implemented now:
news_items / ingest- live source ingestion
- canonical URL normalization
- LLM relevance classification
- one-sentence summary generation
- privacy/review/publication/takedown status assignment
- operational persistence through the configured store
news_items / release- reads operational rows
- filters to publicly releasable metadata-only rows
- writes
news_items.parquet,news_items.csv,MANIFEST.json,SCHEMA.json,SCHEMA.md,README.md, andchecksums.txt - publishes to Kaggle and Hugging Face when configured
news_items / backup- finds the latest release bundle
- uploads it to Google Drive and S3-compatible object storage when configured
Still intentionally deferred:
- additional dataset implementations beyond
news_items - richer human review tooling / admin UI
- more advanced privacy policies beyond the current pragmatic gate
Preferred checked-in config layout:
agents/
news/
local.yaml
github.yaml
release/
news_items.yaml
backup/
news_items.yaml
Backward-compatible shims are still present:
agents/news.yamlagents/news-github.yaml
Current intent:
agents/news/...drives ingest jobsagents/release/...drives release jobsagents/backup/...drives backup jobs
The release and backup commands still accept any compatible config path you pass explicitly, but
their default paths now point at dedicated config files instead of reusing the ingest config.
The current GitHub Actions layer is still news-items-first, but it is now parameterized around shared dataset/job env variables:
DATASET_NAMEJOB_NAMEJOB_CONFIG_PATHSTATE_JOB_DIR
This keeps the current scheduled news ingest behavior unchanged while making the workflow files easier to extend for future dataset/job combinations.
The public news_items dataset is metadata-only. Each public row contains:
- deterministic row id
- source name and source domain
- original and canonical URL
- publication and retrieval timestamps
- title
- category and sub-category
- one-sentence factual summary
- geographic fields when available
- organizations and topic tags
- rights / privacy / review / publication / takedown status
- release version
The public dataset intentionally excludes:
- article full text
- cached HTML
- page screenshots or snapshots
- private ingestion diagnostics
Rows are excluded from public release when they are:
- suppressed by a takedown/suppression rule
- marked
internal_only - still pending privacy review
- otherwise non-public under the shared policy enums
Each news_items release currently writes:
news_items.parquetas the canonical exportnews_items.csvMANIFEST.jsonSCHEMA.jsonSCHEMA.mdREADME.mdchecksums.txt
Release versions use a UTC date string such as 2026-03-22.
| Mode | Reads from | Writes to | External integrations |
|---|---|---|---|
| Local ingest | live sources + local seen store | local namespaced state + local JSON operational store (from agents/news/local.yaml) |
Anthropic, optional SMTP |
| GitHub ingest | live sources + shared state repo seen store | shared state repo + Supabase | Anthropic, Supabase, optional SMTP |
| Weekly release | Supabase news_items rows |
release bundle under news_items/release/publication + release run snapshot |
optional Kaggle, optional Hugging Face |
| Weekly backup | latest built release bundle under news_items/release/publication |
backup run snapshot under news_items/backup/runs |
optional Google Drive, optional S3-compatible object storage |
In other words:
- local ingest uses local state plus the local JSON operational store by default
- GitHub ingest uses the shared state repo plus Supabase
- release reads releasable rows from Supabase, builds the bundle, and only publishes to public targets when they are configured
- backup does not rebuild the release; it uploads the latest already-built bundle when backup targets are configured
ANTHROPIC_API_KEYDENBUST_SUPABASE_URLfor GitHub/Supabase-backed ingestDENBUST_SUPABASE_SERVICE_ROLE_KEYfor GitHub/Supabase-backed ingest- SMTP variables when email output is enabled
DENBUST_SUPABASE_URLDENBUST_SUPABASE_SERVICE_ROLE_KEYDENBUST_KAGGLE_DATASETto enable Kaggle publishingKAGGLE_USERNAMEKAGGLE_KEYDENBUST_HUGGINGFACE_REPO_IDto enable Hugging Face publishingHF_TOKEN
DENBUST_DRIVE_SERVICE_ACCOUNT_JSONDENBUST_DRIVE_FOLDER_IDDENBUST_OBJECT_STORE_BUCKETDENBUST_OBJECT_STORE_PREFIX(optional; defaults tonews_items/latest)DENBUST_OBJECT_STORE_ENDPOINT_URLDENBUST_OBJECT_STORE_ACCESS_KEY_IDDENBUST_OBJECT_STORE_SECRET_ACCESS_KEY
The Phase B integrations are intentionally split into:
- required backends for a given job
- optional targets that are only activated when explicitly configured
DENBUST_SUPABASE_URLandDENBUST_SUPABASE_SERVICE_ROLE_KEYare required for the checked-in GitHub ingest config and for the checked-in release config- the service-role key is used because ingest, release, and suppression-aware export assembly all need privileged operational access
- if the selected config uses
operational.provider: supabaseand these variables are missing, the job fails
- Kaggle publishing is activated only when
DENBUST_KAGGLE_DATASETis set - if
DENBUST_KAGGLE_DATASETis not set, the release job still builds the bundle and skips Kaggle publication - if
DENBUST_KAGGLE_DATASETis set butKAGGLE_USERNAMEorKAGGLE_KEYis missing, the release job fails
- Hugging Face publication is activated only when
DENBUST_HUGGINGFACE_REPO_IDis set - if
DENBUST_HUGGINGFACE_REPO_IDis not set, the release job still builds the bundle and skips Hugging Face publication - if
DENBUST_HUGGINGFACE_REPO_IDis set butHF_TOKENis missing, the release job fails
- Google Drive backup is activated when the backup config enables the target or when
DENBUST_DRIVE_FOLDER_IDis present - the checked-in backup config keeps the target disabled for local safety; in GitHub Actions the folder-id secret can activate it implicitly
- if the target is inactive, backup skips Google Drive cleanly
- if the target is active but
DENBUST_DRIVE_SERVICE_ACCOUNT_JSONis missing, the backup job fails
- object-storage backup is activated when the backup config enables the target or when
DENBUST_OBJECT_STORE_BUCKETis present DENBUST_OBJECT_STORE_PREFIXis optional and defaults tonews_items/latest- if the target is inactive, backup skips object storage cleanly
- if the target is active but
DENBUST_OBJECT_STORE_ACCESS_KEY_IDorDENBUST_OBJECT_STORE_SECRET_ACCESS_KEYis missing, the backup job fails
- release is considered successful if the bundle is built; skipped public targets are surfaced as warnings in logs/run snapshots
- backup is considered successful if the command completes; zero configured targets is treated as a warning, not a failure
- if a configured publication or backup target is missing required credentials, that target currently fails the job rather than silently skipping
| Workflow | Required secrets | Optional secrets |
|---|---|---|
daily-state-run.yml / weekly-state-run.yml |
STATE_REPO_PAT, ANTHROPIC_API_KEY, DENBUST_SUPABASE_URL, DENBUST_SUPABASE_SERVICE_ROLE_KEY |
SMTP/email secrets if email output is enabled |
news-items-release.yml |
STATE_REPO_PAT, DENBUST_SUPABASE_URL, DENBUST_SUPABASE_SERVICE_ROLE_KEY |
DENBUST_KAGGLE_DATASET, KAGGLE_USERNAME, KAGGLE_KEY, DENBUST_HUGGINGFACE_REPO_ID, HF_TOKEN |
news-items-backup.yml |
STATE_REPO_PAT |
DENBUST_DRIVE_FOLDER_ID, DENBUST_DRIVE_SERVICE_ACCOUNT_JSON, DENBUST_OBJECT_STORE_BUCKET, DENBUST_OBJECT_STORE_PREFIX, DENBUST_OBJECT_STORE_ENDPOINT_URL, DENBUST_OBJECT_STORE_ACCESS_KEY_ID, DENBUST_OBJECT_STORE_SECRET_ACCESS_KEY |
The release and backup workflows both support workflow_dispatch for manual runs and weekly schedules for automated runs.
Recommended GitHub Environment mapping:
news-items-ingestfordaily-state-run.ymlandweekly-state-run.ymlnews-items-releasefornews-items-release.ymlnews-items-backupfornews-items-backup.yml
The code reads generic env vars at runtime, so the same variable names can safely have different values per GitHub Environment.
Phase B adds SQL migrations under:
supabase/migrations/
Apply the news_items migration before running Supabase-backed jobs. The schema includes:
news_itemsingestion_runsrelease_runsbackup_runssuppression_rules
suppression_rules is the minimal takedown/suppression path. Add rows there by canonical URL or row
id to block future public releases.
The checked-in local ingest config uses the local JSON operational store:
denbust scan --config agents/news/local.yaml
denbust release --config agents/release/news_items.yaml
denbust backup --config agents/backup/news_items.yamlTo run release locally without Supabase, either:
- switch
operational.providertolocal_jsonin the release config, or - provide a custom config path with that override
The checked-in GitHub ingest config uses the Supabase operational store:
agents/news/github.yaml
Release and backup jobs rely on dedicated configs:
agents/release/news_items.yamlagents/backup/news_items.yaml
For backup specifically:
- the checked-in YAML keeps both targets disabled for local safety
DENBUST_DRIVE_FOLDER_IDauto-enables Google Drive backup at config-load timeDENBUST_OBJECT_STORE_BUCKETauto-enables object-storage backup at config-load time- because the backup config no longer hardcodes
store.publication_dir, it reads the latest release bundle from the current state root undernews_items/release/publication
- privacy/risk gating is intentionally lightweight and conservative, not a substitute for legal review
- publication and backup integrations require external credentials and cannot be fully exercised in CI
- only
news_itemsis implemented end to end in this phase
📍 פשיטה על בית בושת ברמת גן
תאריך: 2026-02-15
קטגוריה: בית בושת
תקציר: המשטרה פשטה על דירה ברמת גן...
מקורות:
• Ynet: https://ynet.co.il/...
• Mako: https://mako.co.il/...
- Product Definition - Full project background (Hebrew)
- MVP Spec - Phase 1 technical scope
- Implementation Plan - Task breakdown
- Phase A (current): multi-dataset platform spine + working
news_items / ingest - Phase B:
news_itemsdataset evolution and release/backup implementation - Later: docs metadata, open-docs fulltext, events, and downstream analytics