Skip to content

Commit bf88b5e

Browse files
committed
Add AutoThrottle and retry settings to enhance crawling efficiency and politeness
1 parent e7dc92b commit bf88b5e

3 files changed

Lines changed: 33 additions & 0 deletions

File tree

.env.example

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@ LOCAL_DEDUP_FALLBACK_ENABLED=true
77
LOCAL_DEDUP_STATE_PATH=./out/ingestion.seen.json
88
INCREMENTAL_ENABLED=true
99
STATE_STORE_PATH=./out/source_state.sqlite
10+
AUTOTHROTTLE_ENABLED=true
11+
AUTOTHROTTLE_START_DELAY=1
12+
AUTOTHROTTLE_MAX_DELAY=10
13+
AUTOTHROTTLE_TARGET_CONCURRENCY=1
14+
DOWNLOAD_DELAY=1
15+
CONCURRENT_REQUESTS=8
16+
CONCURRENT_REQUESTS_PER_DOMAIN=2
17+
RETRY_ENABLED=true
18+
RETRY_TIMES=3
1019
METRICS_ENABLED=true
1120
METRICS_OUTPUT_PATH=./out/metrics/ingestion_metrics.jsonl
1221
VALIDATION_ENABLED=true

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ What this project currently does:
1313
- Normalizes output records into a common ingestion envelope.
1414
- Publishes events to Kafka (production path) or JSONL (local development path).
1515
- Applies local deduplication using stable IDs with fallback fingerprint dedup when IDs are missing.
16+
- Uses conservative AutoThrottle and retry defaults so crawls stay polite on shared/public sources.
1617
- Persists source crawl state in SQLite so repeated runs are efficient and idempotent.
1718

1819
What this repository does not include yet:
@@ -57,6 +58,7 @@ Completed:
5758
- [x] Structured item-ingestion logs (JSON log lines)
5859
- [x] Backfill execution mode in startup script (`--backfill`)
5960
- [x] Pydantic schema validation in ingestion pipeline (configurable mode)
61+
- [x] AutoThrottle and retry defaults for polite crawling
6062
- [x] Automated startup script with `--all`, `--spider`, `--reset-state`, `--skip-install`
6163
- [x] Docker and GitHub Actions scaffolding
6264

@@ -175,6 +177,15 @@ The script uses the local JSONL sink by default. To route output to Kafka, set `
175177
- LOCAL_DEDUP_STATE_PATH: sidecar file that stores seen dedup keys
176178
- INCREMENTAL_ENABLED: turn source-level incremental crawling on/off
177179
- STATE_STORE_PATH: SQLite state file used for source watermarks and HTTP cache headers
180+
- AUTOTHROTTLE_ENABLED: enable Scrapy AutoThrottle (default: true)
181+
- AUTOTHROTTLE_START_DELAY: initial delay in seconds for AutoThrottle (default: 1)
182+
- AUTOTHROTTLE_MAX_DELAY: max delay in seconds for AutoThrottle (default: 10)
183+
- AUTOTHROTTLE_TARGET_CONCURRENCY: target concurrent requests per server (default: 1)
184+
- DOWNLOAD_DELAY: fixed download delay used alongside AutoThrottle (default: 1)
185+
- CONCURRENT_REQUESTS: total concurrent requests across the crawler (default: 8)
186+
- CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests per domain (default: 2)
187+
- RETRY_ENABLED: enable retries for transient HTTP failures (default: true)
188+
- RETRY_TIMES: number of retry attempts for transient HTTP failures (default: 3)
178189
- METRICS_ENABLED: emit per-spider run metrics to JSONL file (default: true)
179190
- METRICS_OUTPUT_PATH: JSONL path for ingestion run metrics
180191
- VALIDATION_ENABLED: enable or disable Pydantic validation in the pipeline (default: true)

src/bioscope_ingestion/settings.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,23 @@
1+
from common.config import env_bool, env_int
2+
13
BOT_NAME = "bioscope_ingestion"
24

35
SPIDER_MODULES = ["bioscope_ingestion.spiders"]
46
NEWSPIDER_MODULE = "bioscope_ingestion.spiders"
57

68
ROBOTSTXT_OBEY = True
79

10+
AUTOTHROTTLE_ENABLED = env_bool("AUTOTHROTTLE_ENABLED", True)
11+
AUTOTHROTTLE_START_DELAY = env_int("AUTOTHROTTLE_START_DELAY", 1)
12+
AUTOTHROTTLE_MAX_DELAY = env_int("AUTOTHROTTLE_MAX_DELAY", 10)
13+
AUTOTHROTTLE_TARGET_CONCURRENCY = float(env_int("AUTOTHROTTLE_TARGET_CONCURRENCY", 1))
14+
DOWNLOAD_DELAY = float(env_int("DOWNLOAD_DELAY", 1))
15+
CONCURRENT_REQUESTS = env_int("CONCURRENT_REQUESTS", 8)
16+
CONCURRENT_REQUESTS_PER_DOMAIN = env_int("CONCURRENT_REQUESTS_PER_DOMAIN", 2)
17+
RETRY_ENABLED = env_bool("RETRY_ENABLED", True)
18+
RETRY_TIMES = env_int("RETRY_TIMES", 3)
19+
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 522, 524, 408]
20+
821
ITEM_PIPELINES = {
922
"bioscope_ingestion.pipelines.KafkaPipeline": 300,
1023
}

0 commit comments

Comments
 (0)