Scraper that resolves a handler by URL domain and fetches scraped items (single detail page or index with optional full-detail population). Real estate listings are one supported type; the pipeline is generic and can be extended to other result types (e.g. job ads, products).
Entrypoints: cmd/cli (one-off runs, JSON to stdout or file) and cmd/server (Asynq worker: index → page → detail → exit, file or webhook output).
The CLI takes a URL (as -url or as a positional argument) and prints JSON to stdout, or writes to -output <path>.
This project includes an HTTP proxy pool that fetches free proxies from ProxyScrape, validates them, and then serves them round-robin.
- Cache file: validated proxies are persisted to
proxies.jsonin the project root (current working directory). - Fast subsequent runs: if
proxies.jsonexists, the CLI will load it and skip revalidating hundreds of proxies. - Clear cache: pass
-clear-proxy-cache(or--clear-proxy-cache) to deleteproxies.jsonand force a refetch + revalidation. - Why HTTP check? Free HTTP proxies frequently don't support HTTPS
CONNECT, so validation uses an HTTP IP endpoint (http://api.ipify.org) by default.
| Flag | Description |
|---|---|
-url |
URL to scrape: list page (with -index) or a single detail page |
-index |
Scrape listing index from the given URL; otherwise scrape a single detail page |
-page |
Page number when using -index (1-based). Default: 1 |
-populate |
With -index, fetch full details for each item (slower, uses concurrency pool) |
-output |
Write JSON to this path instead of stdout |
-debug |
Enable debug logging |
-clear-proxy-cache |
Delete proxies.json and force refetch + revalidation of proxies |
Optional; godotenv is loaded via autoload, so a .env file in the working directory is applied.
| Variable | Purpose | Default |
|---|---|---|
SCRAPE_URL |
Default URL when -url is not set |
(none) |
SCRAPE_INDEX |
Default for -index (true/false) |
false |
SCRAPE_PAGE |
Default page when using -index |
1 |
SCRAPE_POPULATE |
Default for -populate |
false |
SCRAPE_OUTPUT |
Default output path | (stdout) |
SCRAPE_DEBUG |
Enable debug logging | false |
SCRAPE_CONCURRENCY |
Max parallel detail fetches when using -index -populate. Integer or e.g. 50%. |
1 |
SCRAPE_ALL_PAGES |
CLI only. When true, scrape from the given page through the last page in one run. Leave unset in server mode — the server schedules one task per page. |
unset (single page) |
PROXY_POOL_ENABLED |
Use proxy pool for requests | false |
PROXY_CACHE_PATH |
Path to proxy cache file | proxies.json |
CLEAR_PROXY_CACHE |
If true, clear cache on startup |
false |
Copy .env.example to .env and adjust as needed.
SCRAPE_ALL_PAGES is only relevant when running the CLI. Use SCRAPE_ALL_PAGES=true if you want one CLI run to fetch every listing page from the given -page through the last. In server mode you do not need it: the server enqueues one task per page.
Single detail page — one item, JSON to stdout:
go run ./cmd/cli "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"Single detail page to file:
go run ./cmd/cli -output listing.json "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"Index (list) page — one page of links and basic fields:
go run ./cmd/cli -index "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1
go run ./cmd/cli -index -url "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -output index.jsonIndex with full details — fetch list then populate each item (uses SCRAPE_CONCURRENCY):
go run ./cmd/cli -index -populate "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1 -output full.jsonForce rebuild proxy cache (slow):
go run ./cmd/cli -clear-proxy-cacheBuild and run the Asynq worker + HTTP API:
make build
./bin/scraperThe server reads config (e.g. config.yml), runs the worker, and exposes HTTP endpoints to enqueue scrape runs. Output can be a file path or a webhook URL.
-
Define your item type (if not reusing an existing one): implement
scraped.ScrapedItem(GetSourceURL() string,ResultKind() string) and callscraped.RegisterKind(kind, unmarshaler)ininit()so storage can round-trip your type. -
Implement the scraper under
internal/infrastructure/scrapers/(e.g.example.com/package):- Implement
scraper.IScraper:GetDetailsPage(ctx, url) (scraped.ScrapedItem, error)andGetIndexPage(ctx, page) ([]scraped.ScrapedItem, error), plusGetTotalPages(ctx) (int, error).
- Implement
-
Register the scraper in
internal/infrastructure/scrapers/allso it is included in the default registry (e.g.registry.Register("example.com", New())).
Use error-returning rod APIs (Navigate, WaitLoad, Elements, etc.) instead of Must* so failures return errors instead of panicking under the worker.
Use a Registry in tests: create with scrapers.NewRegistry(), register only the scrapers you need (or mocks), then call Get(url).