scraper

Scraper that resolves a handler by URL domain and fetches scraped items (single detail page or index with optional full-detail population). Real estate listings are one supported type; the pipeline is generic and can be extended to other result types (e.g. job ads, products).

Entrypoints: cmd/cli (one-off runs, JSON to stdout or file) and cmd/server (Asynq worker: index → page → detail → exit, file or webhook output).

Run

The CLI takes a URL (as -url or as a positional argument) and prints JSON to stdout, or writes to -output <path>.

Proxy pool + cache

This project includes an HTTP proxy pool that fetches free proxies from ProxyScrape, validates them, and then serves them round-robin.

Cache file: validated proxies are persisted to proxies.json in the project root (current working directory).
Fast subsequent runs: if proxies.json exists, the CLI will load it and skip revalidating hundreds of proxies.
Clear cache: pass -clear-proxy-cache (or --clear-proxy-cache) to delete proxies.json and force a refetch + revalidation.
Why HTTP check? Free HTTP proxies frequently don't support HTTPS CONNECT, so validation uses an HTTP IP endpoint (http://api.ipify.org) by default.

Flags

Flag	Description
`-url`	URL to scrape: list page (with `-index`) or a single detail page
`-index`	Scrape listing index from the given URL; otherwise scrape a single detail page
`-page`	Page number when using `-index` (1-based). Default: 1
`-populate`	With `-index`, fetch full details for each item (slower, uses concurrency pool)
`-output`	Write JSON to this path instead of stdout
`-debug`	Enable debug logging
`-clear-proxy-cache`	Delete `proxies.json` and force refetch + revalidation of proxies

Environment variables

Optional; godotenv is loaded via autoload, so a .env file in the working directory is applied.

Variable	Purpose	Default
`SCRAPE_URL`	Default URL when `-url` is not set	(none)
`SCRAPE_INDEX`	Default for `-index` (true/false)	false
`SCRAPE_PAGE`	Default page when using `-index`	1
`SCRAPE_POPULATE`	Default for `-populate`	false
`SCRAPE_OUTPUT`	Default output path	(stdout)
`SCRAPE_DEBUG`	Enable debug logging	false
`SCRAPE_CONCURRENCY`	Max parallel detail fetches when using `-index -populate`. Integer or e.g. `50%`.	1
`SCRAPE_ALL_PAGES`	CLI only. When `true`, scrape from the given page through the last page in one run. Leave unset in server mode — the server schedules one task per page.	unset (single page)
`PROXY_POOL_ENABLED`	Use proxy pool for requests	false
`PROXY_CACHE_PATH`	Path to proxy cache file	proxies.json
`CLEAR_PROXY_CACHE`	If `true`, clear cache on startup	false

Copy .env.example to .env and adjust as needed.

SCRAPE_ALL_PAGES is only relevant when running the CLI. Use SCRAPE_ALL_PAGES=true if you want one CLI run to fetch every listing page from the given -page through the last. In server mode you do not need it: the server enqueues one task per page.

Sample commands

Single detail page — one item, JSON to stdout:

go run ./cmd/cli "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"

Single detail page to file:

go run ./cmd/cli -output listing.json "https://www.engelvoelkers.com/ch/en/exposes/5fd84e3f-7c44-5cb2-b31c-c093348b3432"

Index (list) page — one page of links and basic fields:

go run ./cmd/cli -index "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1
go run ./cmd/cli -index -url "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -output index.json

Index with full details — fetch list then populate each item (uses SCRAPE_CONCURRENCY):

go run ./cmd/cli -index -populate "https://www.engelvoelkers.com/ch/en/properties/res/sale/real-estate" -page 1 -output full.json

Force rebuild proxy cache (slow):

go run ./cmd/cli -clear-proxy-cache

Server

Build and run the Asynq worker + HTTP API:

make build
./bin/scraper

The server reads config (e.g. config.yml), runs the worker, and exposes HTTP endpoints to enqueue scrape runs. Output can be a file path or a webhook URL.

Adding a new scraper

Define your item type (if not reusing an existing one): implement scraped.ScrapedItem (GetSourceURL() string, ResultKind() string) and call scraped.RegisterKind(kind, unmarshaler) in init() so storage can round-trip your type.
Implement the scraper under internal/infrastructure/scrapers/ (e.g. example.com/ package):
- Implement scraper.IScraper: GetDetailsPage(ctx, url) (scraped.ScrapedItem, error) and GetIndexPage(ctx, page) ([]scraped.ScrapedItem, error), plus GetTotalPages(ctx) (int, error).
Register the scraper in internal/infrastructure/scrapers/all so it is included in the default registry (e.g. registry.Register("example.com", New())).

Use error-returning rod APIs (Navigate, WaitLoad, Elements, etc.) instead of Must* so failures return errors instead of panicking under the worker.

Tests

Use a Registry in tests: create with scrapers.NewRegistry(), register only the scrapers you need (or mocks), then call Get(url).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmd		cmd
internal		internal
pkg/env		pkg/env
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.yml		config.yml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
proxies.json		proxies.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraper

Run

Proxy pool + cache

Flags

Environment variables

Sample commands

Server

Adding a new scraper

Tests

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scraper

Run

Proxy pool + cache

Flags

Environment variables

Sample commands

Server

Adding a new scraper

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages