website-refresh-leads

website-refresh-leads is a local MVP for discovering and evaluating small-business websites that may be strong candidates for website refresh or redesign services.

V1 Goal

Build a lightweight local pipeline that can surface potential leads, gather site evidence, run a preserved deterministic path or an OpenAI-powered model-judge path, and export a compact review package for manual outreach review.

Current Status

The repo is now a script-driven MVP for repeated weekly lead runs. Discovery, prefiltering, crawl, browser checks, final judgment, and review-package export are all wired together for local use.

The architecture is now preservation-first hybrid:

deterministic prefiltering remains the lightweight admission gate
deterministic rubric scoring is preserved and still runnable
model_judge is the new default scoring mode and intended primary direction
model_judge now uses the OpenAI Responses API with compact multimodal evidence and strict structured output
compare mode preserves deterministic scoring while exporting model judgment as the primary review output

Canonical website memory is now durable across runs. By default, if a canonical website was surfaced in any prior run, future runs skip it even when the prior lead was weak or only partially evidenced.

Workflow

Discovery: find candidate businesses and websites.
Prefilter: mark obvious strong, maybe, and skip admission outcomes.
Crawl: fetch core site pages and save raw HTML.
Screenshots / Checks: capture homepage screenshots and browser signals.
Final Judgment: run the selected scoring mode and store notes.
Export / Review: create a compact shortlist package for current-run manual review.

The deterministic rubric still uses the evidence the repo already collects. If crawl coverage is partial but browser validation still confirms the homepage is reachable, the lead can still be scored with lower confidence instead of automatically collapsing to zero.

Setup

Create and activate a virtual environment.
Install dependencies.
Install Playwright browsers.
Copy .env.example to .env and fill in your Places API key and local OpenAI key.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install
cp .env.example .env

Database Init

Create the local SQLite database and tables:

python3 -m app.init_db

If you already have an older local database from before the canonical-memory schema change, the code will try to backfill it automatically. For the cleanest path after this upgrade, a one-time reset is acceptable:

python3 -m app.init_db --reset

The project reads configuration from .env via python-dotenv. The main database setting is:

DATABASE_URL=sqlite:///data/leads.db

OpenAI model judging is configured the same way:

OPENAI_API_KEY=
OPENAI_MODEL=gpt-5.4-mini

Single-Query Usage

Run the full pipeline for one query:

python3 -m app.main --query "painters lowell ma" --niche painters

The main runner now supports:

--scoring-mode model_judge (default)
--scoring-mode deterministic
--scoring-mode compare

Mode behavior:

model_judge: uses GPT-5.4 mini through the Responses API and exports ModelJudgment rows as the primary review package
deterministic: preserves the old deterministic scoring and export path
compare: runs deterministic scoring plus model judging, then exports model judgment with deterministic comparison fields

Optional discovery controls:

python3 -m app.main \
  --query "painters lowell ma" \
  --niche painters \
  --scoring-mode model_judge \
  --page-size 10 \
  --max-pages 2

Default duplicate handling is strict across runs at the canonical website level. A later revisit path is plumbed through --allow-revisit, but revisits still require the stored business row to be explicitly marked eligible_for_revisit first.

Default exports are also strict: each PipelineRun writes its own review package under data/exports/runs/run_<run_id>/, and those exports only include businesses first admitted in that run. Older leads do not resurface in the default review package unless a dedicated export override is added later.

Multi-Query Usage

Run multiple queries from a plain text file:

python3 -m app.main --query-file prompts/queries.txt --niche painters

The query file supports:

One query per line, using the CLI --niche as the shared niche
Or query | niche per line when different niches are needed

When you use --query-file, the pipeline now also builds one batch-level review package under data/exports/batches/batch_<timestamp>_<query_file_stem>/. That batch export:

combines only the non-empty per-run exports from that invocation by default
writes combined_review_package.json, combined_review_package.csv, batch_summary.csv, and review_screenshots/
preserves which run each exported lead came from
deletes only that batch invocation's per-run export folders under data/exports/runs/ after the batch export is written successfully

If batch export creation fails, the per-run export folders are left in place.

Example:

painters lowell ma
painters chelmsford ma
pressure washing nashua nh | pressure_washing

Discovery-Only Usage

If you want to run discovery by itself:

python3 -m app.discovery.run_places \
  --query "painters lowell ma" \
  --niche painters \
  --page-size 10 \
  --max-pages 2

Output Files

The pipeline writes local artifacts to:

data/leads.db: SQLite database
data/raw/: raw HTML captured during crawl
data/screenshots/: desktop and mobile homepage screenshots
data/browser_checks/: JSON browser-check reports
data/exports/runs/run_<run_id>/review_package.csv: flat shortlist export for one run
data/exports/runs/run_<run_id>/review_package.json: structured shortlist export for one run
data/exports/runs/run_<run_id>/review_screenshots/: copied screenshots bundled with that run's review package
data/exports/batches/batch_<timestamp>_<query_file_stem>/combined_review_package.json: combined structured shortlist for one --query-file batch
data/exports/batches/batch_<timestamp>_<query_file_stem>/combined_review_package.csv: combined flat shortlist for one --query-file batch
data/exports/batches/batch_<timestamp>_<query_file_stem>/batch_summary.csv: run-level inclusion and count summary for the batch
data/exports/batches/batch_<timestamp>_<query_file_stem>/review_screenshots/: screenshots copied for the included runs in that batch
data/exports/batches/batch_<timestamp>_<query_file_stem>/batch_metadata.json: lightweight batch metadata and cleanup results

The review package includes current-run new candidates only. Within that scope it includes:

business info and review counts
run/debug fields such as query_used, canonical URL/key, and discovery_run_id
final fit_status, confidence, evidence quality, and recommended action
model-judgment fields such as website weakness, outreach-story strength, positive signals, and evidence warnings
deterministic comparison fields when the run used --scoring-mode compare
per-dimension score breakdown when the run used deterministic scoring
compact review context for manual ranking: why it qualified, evidence strength, and outreach-story strength
selected page URLs
screenshot paths
top issues, quick summary, teardown angle, and skip reason

If a run produces zero strong or maybe leads, the exporter automatically falls back to the top scored skip leads from that same run only, so prior-run leads still do not reappear by default.

For a single query run, check the run_<run_id> folder printed at the end of the export step. For a --query-file batch, the pipeline still creates those per-run folders first for traceability, then rolls the current batch into one batch export folder and removes only the current batch's now-redundant per-run export folders after the combined batch export succeeds.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
prompts		prompts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

website-refresh-leads

V1 Goal

Current Status

Workflow

Setup

Database Init

Single-Query Usage

Multi-Query Usage

Discovery-Only Usage

Output Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

website-refresh-leads

V1 Goal

Current Status

Workflow

Setup

Database Init

Single-Query Usage

Multi-Query Usage

Discovery-Only Usage

Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages