Skip to content

paulgrammer/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scraper

Scraper that resolves a handler by URL domain and fetches scraped items (single detail page or index with optional full-detail population). Real estate listings are one supported type; the pipeline is generic and can be extended to other result types (e.g. job ads, products).

Entrypoints: cmd/cli (one-off runs, JSON to stdout or file) and cmd/server (Asynq worker: index → page → detail → exit, file or webhook output).

Run

The CLI takes a URL (as -url or as a positional argument) and prints JSON to stdout, or writes to -output <path>.

Proxy pool + cache

This project includes an HTTP proxy pool that fetches free proxies from ProxyScrape, validates them, and then serves them round-robin.

  • Cache file: validated proxies are persisted to proxies.json in the project root (current working directory).
  • Fast subsequent runs: if proxies.json exists, the CLI will load it and skip revalidating hundreds of proxies.
  • Clear cache: pass -clear-proxy-cache (or --clear-proxy-cache) to delete proxies.json and force a refetch + revalidation.
  • Why HTTP check? Free HTTP proxies frequently don't support HTTPS CONNECT, so validation uses an HTTP IP endpoint (http://api.ipify.org) by default.

Flags

Flag Description
-url URL to scrape: list page (with -index) or a single detail page
-index Scrape listing index from the given URL; otherwise scrape a single detail page
-page Page number when using -index (1-based). Default: 1
-populate With -index, fetch full details for each item (slower, uses concurrency pool)
-output Write JSON to this path instead of stdout
-debug Enable debug logging
-clear-proxy-cache Delete proxies.json and force refetch + revalidation of proxies

Environment variables

Optional; godotenv is loaded via autoload, so a .env file in the working directory is applied.

Variable Purpose Default
SCRAPE_URL Default URL when -url is not set (none)
SCRAPE_INDEX Default for -index (true/false) false
SCRAPE_PAGE Default page when using -index 1
SCRAPE_POPULATE Default for -populate false
SCRAPE_OUTPUT Default output path (stdout)
SCRAPE_DEBUG Enable debug logging false
SCRAPE_CONCURRENCY Max parallel detail fetches when using -index -populate. Integer or e.g. 50%. 1
SCRAPE_ALL_PAGES CLI only. When true, scrape from the given page through the last page in one run. Leave unset in server mode — the server schedules one task per page. unset (single page)
PROXY_POOL_ENABLED Use proxy pool for requests false
PROXY_CACHE_PATH Path to proxy cache file proxies.json
CLEAR_PROXY_CACHE If true, clear cache on startup false

Copy .env.example to .env and adjust as needed.

SCRAPE_ALL_PAGES is only relevant when running the CLI. Use SCRAPE_ALL_PAGES=true if you want one CLI run to fetch every listing page from the given -page through the last. In server mode you do not need it: the server enqueues one task per page.

Sample commands

Single detail page — one item, JSON to stdout:

go run ./cmd/cli "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"

Single detail page to file:

go run ./cmd/cli -output listing.json "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"

Index (list) page — one page of links and basic fields:

go run ./cmd/cli -index "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1
go run ./cmd/cli -index -url "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -output index.json

Index with full details — fetch list then populate each item (uses SCRAPE_CONCURRENCY):

go run ./cmd/cli -index -populate "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1 -output full.json

Force rebuild proxy cache (slow):

go run ./cmd/cli -clear-proxy-cache

Server

Build and run the Asynq worker + HTTP API:

make build
./bin/scraper

The server reads config (e.g. config.yml), runs the worker, and exposes HTTP endpoints to enqueue scrape runs. Output can be a file path or a webhook URL.

Adding a new scraper

  1. Define your item type (if not reusing an existing one): implement scraped.ScrapedItem (GetSourceURL() string, ResultKind() string) and call scraped.RegisterKind(kind, unmarshaler) in init() so storage can round-trip your type.

  2. Implement the scraper under internal/infrastructure/scrapers/ (e.g. example.com/ package):

    • Implement scraper.IScraper: GetDetailsPage(ctx, url) (scraped.ScrapedItem, error) and GetIndexPage(ctx, page) ([]scraped.ScrapedItem, error), plus GetTotalPages(ctx) (int, error).
  3. Register the scraper in internal/infrastructure/scrapers/all so it is included in the default registry (e.g. registry.Register("example.com", New())).

Use error-returning rod APIs (Navigate, WaitLoad, Elements, etc.) instead of Must* so failures return errors instead of panicking under the worker.

Tests

Use a Registry in tests: create with scrapers.NewRegistry(), register only the scrapers you need (or mocks), then call Get(url).

About

Extensible web scraper for any defined domain, with built-in support for real estate listings, job boards, and classified ads

Topics

Resources

Stars

Watchers

Forks

Contributors