Web data extraction terminal. Three engines. Full session control. Zero framework overhead.
ScrapeGoat is a full-stack web scraper with a tech-noir terminal UI. The frontend is a single self-contained HTML file. The backend is a pure Python server implementing six scraping classes β from fast TLS-fingerprinted HTTP requests to full Playwright Chromium browser automation β with built-in session management, proxy rotation, and domain blocking.
No framework. No build step. One command to run.
| Engine | Class | Best For |
|---|---|---|
| Fetcher | Fetcher / FetcherSession |
Static pages β fast HTTP with browser TLS fingerprint |
| Stealth | StealthyFetcher / StealthySession |
Bot-protected pages β header rotation, timing jitter, IP spoofing |
| Dynamic | DynamicFetcher / DynamicSession |
JS-rendered SPAs β full headless Chromium via Playwright |
- Python 3.10+
- pip
# Clone or unzip the project
cd scrapegoat/
# macOS / Linux
bash start.sh
# Windows
start.batOpen http://localhost:7331 in your browser.
That's it. start.sh installs all dependencies, installs the Playwright Chromium binary, and starts the server in one step.
pip install -r requirements.txt
playwright install chromium
python3 server.py- TLS fingerprint impersonation β 5 browser profiles (Chrome/Win, Chrome/Mac, Firefox, Safari, Edge) with exact Sec-CH-UA, Sec-Fetch-*, Accept headers
- Dynamic loading β Playwright Chromium with stealth JS injection (removes
navigator.webdriver, spoofs plugins, platform, permissions) - Anti-bot stealth β per-request UA rotation, randomised header insertion order, timing jitter (0β400ms), spoofed X-Forwarded-For / X-Real-IP headers
- Persistent cookies β HTTP sessions use
http.cookiejar.CookieJar; Playwright sessions save/restorecontext.cookies()across requests - Named sessions β create a session ID in the UI; all requests with that ID share the same cookie state
- Session inspection β sidebar shows active sessions; click any to reuse, or clear individually or all at once
- Proxy rotation β
ProxyRotatorwith cyclic and random strategies - Per-request override β bypass the rotator for a single request
- Live management β add/remove proxies from the UI without restarting the server
- All fetcher types β proxy support on Fetcher, StealthyFetcher, and DynamicFetcher
- HTTP fetchers β domain check runs before any request leaves the process
- Playwright β blocked at the network level via
page.route()abort - Subdomain matching β blocking
example.comalso blocks all subdomains
- Tag targeting β
<p>,<h1>,<h2>,<div>,<a>,.class,#id,<td>,<tr> - Custom selectors β
By.CLASS_NAME,By.CSS_SELECTOR,By.ID,By.LINK_TEXT,By.NAME,By.PARTIAL_LINK_TEXT,By.TAG_NAME - Smart deduplication β whitespace normalisation + set deduplication before returning
- Script/style stripping β
<script>and<style>removed before extraction
| Format | Description |
|---|---|
| CSV | Header + one quoted value per row |
| JSON | Pretty-printed array |
| XLSX | TSV β opens natively in Excel |
| SQL | CREATE TABLE + INSERT INTO statements |
- Enter a target URL in the sidebar
- Select engine β Fetcher, Stealth, or Dynamic
- Choose a browser profile (Fetcher and Stealth only)
- Check tags to extract
- Optionally set a custom selector with a
By.*method - Optionally set a Session ID for persistent cookies
- Add proxies and choose rotation strategy if needed
- Add blocked domains to suppress tracking or ad calls
- Hit EXECUTE β watch the terminal stream the scrape log
- Switch to RAW DATA tab to browse results
- Select export format and click DOWNLOAD
scrapegoat/
βββ scrapegoat.html β Full frontend (HTML + CSS + JS, no build step)
βββ server.py β Backend engine (stdlib + bs4 + playwright)
βββ requirements.txt β Python dependencies
βββ start.sh β macOS / Linux launcher
βββ start.bat β Windows launcher
βββ README.md β This file
βββ deployment-guide.md β File placement, deployment procedures, API reference
βββ tech-stack.md β Architecture, class map, technology decisions
scrapegoat.htmlandserver.pymust live in the same directory.
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Serves scrapegoat.html |
| GET | /api/health |
Health check |
| GET | /api/profiles |
Lists browser fingerprint profiles |
| GET | /api/sessions |
Lists active session IDs |
| POST | /api/scrape |
Executes a scrape |
| POST | /api/proxies |
Updates the proxy rotator pool |
| POST | /api/sessions/clear |
Clears one or all sessions |
Full schemas in deployment-guide.md.
All settings are constants in server.py. No .env file required.
| Setting | Default | Edit |
|---|---|---|
| Port | 7331 |
PORT = 7331 |
| Frontend path | ./scrapegoat.html |
FRONTEND_PATH constant |
| HTTP request timeout | 20s |
Fetcher.__init__ default arg |
| Playwright timeout | 30s |
DynamicFetcher.__init__ default arg |
| Stealth jitter | 400ms max |
StealthyFetcher.__init__ default arg |
| Headless mode | True |
DynamicFetcher(headless=False) |
- No HTTP/3 β Python
urllibuses HTTP/1.1. Integratecurl-cffifor full HTTP/3 + TLS fingerprinting - XPATH selectors not supported β use
By.CSS_SELECTORinstead - In-memory sessions only β state is lost on server restart
- Single process β Playwright scrapes block one thread each
-
curl-cffifor HTTP/2 + HTTP/3 TLS fingerprinting - Persistent session storage (SQLite)
- Pagination and recursive crawl
- Scheduled jobs
- Captcha solver integration
MIT β use it, fork it, scrape responsibly.
Authors
David Spies