@@ -13,6 +13,7 @@ What this project currently does:
1313- Normalizes output records into a common ingestion envelope.
1414- Publishes events to Kafka (production path) or JSONL (local development path).
1515- Applies local deduplication using stable IDs with fallback fingerprint dedup when IDs are missing.
16+ - Uses conservative AutoThrottle and retry defaults so crawls stay polite on shared/public sources.
1617- Persists source crawl state in SQLite so repeated runs are efficient and idempotent.
1718
1819What this repository does not include yet:
@@ -57,6 +58,7 @@ Completed:
5758- [x] Structured item-ingestion logs (JSON log lines)
5859- [x] Backfill execution mode in startup script (` --backfill ` )
5960- [x] Pydantic schema validation in ingestion pipeline (configurable mode)
61+ - [x] AutoThrottle and retry defaults for polite crawling
6062- [x] Automated startup script with ` --all ` , ` --spider ` , ` --reset-state ` , ` --skip-install `
6163- [x] Docker and GitHub Actions scaffolding
6264
@@ -175,6 +177,15 @@ The script uses the local JSONL sink by default. To route output to Kafka, set `
175177- LOCAL_DEDUP_STATE_PATH: sidecar file that stores seen dedup keys
176178- INCREMENTAL_ENABLED: turn source-level incremental crawling on/off
177179- STATE_STORE_PATH: SQLite state file used for source watermarks and HTTP cache headers
180+ - AUTOTHROTTLE_ENABLED: enable Scrapy AutoThrottle (default: true)
181+ - AUTOTHROTTLE_START_DELAY: initial delay in seconds for AutoThrottle (default: 1)
182+ - AUTOTHROTTLE_MAX_DELAY: max delay in seconds for AutoThrottle (default: 10)
183+ - AUTOTHROTTLE_TARGET_CONCURRENCY: target concurrent requests per server (default: 1)
184+ - DOWNLOAD_DELAY: fixed download delay used alongside AutoThrottle (default: 1)
185+ - CONCURRENT_REQUESTS: total concurrent requests across the crawler (default: 8)
186+ - CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests per domain (default: 2)
187+ - RETRY_ENABLED: enable retries for transient HTTP failures (default: true)
188+ - RETRY_TIMES: number of retry attempts for transient HTTP failures (default: 3)
178189- METRICS_ENABLED: emit per-spider run metrics to JSONL file (default: true)
179190- METRICS_OUTPUT_PATH: JSONL path for ingestion run metrics
180191- VALIDATION_ENABLED: enable or disable Pydantic validation in the pipeline (default: true)
0 commit comments