🐐 ScrapeGoat

Web data extraction terminal. Three engines. Full session control. Zero framework overhead.

ScrapeGoat is a full-stack web scraper with a tech-noir terminal UI. The frontend is a single self-contained HTML file. The backend is a pure Python server implementing six scraping classes — from fast TLS-fingerprinted HTTP requests to full Playwright Chromium browser automation — with built-in session management, proxy rotation, and domain blocking.

No framework. No build step. One command to run.

Engines

Engine	Class	Best For
Fetcher	`Fetcher` / `FetcherSession`	Static pages — fast HTTP with browser TLS fingerprint
Stealth	`StealthyFetcher` / `StealthySession`	Bot-protected pages — header rotation, timing jitter, IP spoofing
Dynamic	`DynamicFetcher` / `DynamicSession`	JS-rendered SPAs — full headless Chromium via Playwright

Quick Start

Prerequisites

Python 3.10+
pip

Run

# Clone or unzip the project
cd scrapegoat/

# macOS / Linux
bash start.sh

# Windows
start.bat

Open http://localhost:7331 in your browser.

That's it. start.sh installs all dependencies, installs the Playwright Chromium binary, and starts the server in one step.

Manual Start

pip install -r requirements.txt
playwright install chromium
python3 server.py

Features

Fetching

TLS fingerprint impersonation — 5 browser profiles (Chrome/Win, Chrome/Mac, Firefox, Safari, Edge) with exact Sec-CH-UA, Sec-Fetch-*, Accept headers
Dynamic loading — Playwright Chromium with stealth JS injection (removes navigator.webdriver, spoofs plugins, platform, permissions)
Anti-bot stealth — per-request UA rotation, randomised header insertion order, timing jitter (0–400ms), spoofed X-Forwarded-For / X-Real-IP headers

Sessions

Persistent cookies — HTTP sessions use http.cookiejar.CookieJar; Playwright sessions save/restore context.cookies() across requests
Named sessions — create a session ID in the UI; all requests with that ID share the same cookie state
Session inspection — sidebar shows active sessions; click any to reuse, or clear individually or all at once

Proxies

Proxy rotation — ProxyRotator with cyclic and random strategies
Per-request override — bypass the rotator for a single request
Live management — add/remove proxies from the UI without restarting the server
All fetcher types — proxy support on Fetcher, StealthyFetcher, and DynamicFetcher

Domain Blocking

HTTP fetchers — domain check runs before any request leaves the process
Playwright — blocked at the network level via page.route() abort
Subdomain matching — blocking example.com also blocks all subdomains

Extraction

Tag targeting — <p>, <h1>, <h2>, <div>, <a>, .class, #id, <td>, <tr>
Custom selectors — By.CLASS_NAME, By.CSS_SELECTOR, By.ID, By.LINK_TEXT, By.NAME, By.PARTIAL_LINK_TEXT, By.TAG_NAME
Smart deduplication — whitespace normalisation + set deduplication before returning
Script/style stripping — <script> and <style> removed before extraction

Export

Format	Description
CSV	Header + one quoted value per row
JSON	Pretty-printed array
XLSX	TSV — opens natively in Excel
SQL	`CREATE TABLE` + `INSERT INTO` statements

Usage

Enter a target URL in the sidebar
Select engine — Fetcher, Stealth, or Dynamic
Choose a browser profile (Fetcher and Stealth only)
Check tags to extract
Optionally set a custom selector with a By.* method
Optionally set a Session ID for persistent cookies
Add proxies and choose rotation strategy if needed
Add blocked domains to suppress tracking or ad calls
Hit EXECUTE — watch the terminal stream the scrape log
Switch to RAW DATA tab to browse results
Select export format and click DOWNLOAD

Project Structure

scrapegoat/
├── scrapegoat.html      ← Full frontend (HTML + CSS + JS, no build step)
├── server.py            ← Backend engine (stdlib + bs4 + playwright)
├── requirements.txt     ← Python dependencies
├── start.sh             ← macOS / Linux launcher
├── start.bat            ← Windows launcher
├── README.md            ← This file
├── deployment-guide.md  ← File placement, deployment procedures, API reference
└── tech-stack.md        ← Architecture, class map, technology decisions

scrapegoat.html and server.py must live in the same directory.

API

Method	Endpoint	Description
GET	`/`	Serves `scrapegoat.html`
GET	`/api/health`	Health check
GET	`/api/profiles`	Lists browser fingerprint profiles
GET	`/api/sessions`	Lists active session IDs
POST	`/api/scrape`	Executes a scrape
POST	`/api/proxies`	Updates the proxy rotator pool
POST	`/api/sessions/clear`	Clears one or all sessions

Full schemas in deployment-guide.md.

Configuration

All settings are constants in server.py. No .env file required.

Setting	Default	Edit
Port	`7331`	`PORT = 7331`
Frontend path	`./scrapegoat.html`	`FRONTEND_PATH` constant
HTTP request timeout	`20s`	`Fetcher.__init__` default arg
Playwright timeout	`30s`	`DynamicFetcher.__init__` default arg
Stealth jitter	`400ms` max	`StealthyFetcher.__init__` default arg
Headless mode	`True`	`DynamicFetcher(headless=False)`

Limitations

No HTTP/3 — Python urllib uses HTTP/1.1. Integrate curl-cffi for full HTTP/3 + TLS fingerprinting
XPATH selectors not supported — use By.CSS_SELECTOR instead
In-memory sessions only — state is lost on server restart
Single process — Playwright scrapes block one thread each

Roadmap

curl-cffi for HTTP/2 + HTTP/3 TLS fingerprinting
Persistent session storage (SQLite)
Pagination and recursive crawl
Scheduled jobs
Captcha solver integration

License

MIT — use it, fork it, scrape responsibly.

Authors

David Spies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐐 ScrapeGoat

Engines

Quick Start

Prerequisites

Run

Manual Start

Features

Fetching

Sessions

Proxies

Domain Blocking

Extraction

Export

Usage

Project Structure

API

Configuration

Limitations

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md
deployment-guide.md		deployment-guide.md
requirements.txt		requirements.txt
scrapegoat.html		scrapegoat.html
server.py		server.py
start.bat		start.bat
start.sh		start.sh
tech-stack.md		tech-stack.md

Folders and files

Latest commit

History

Repository files navigation

🐐 ScrapeGoat

Engines

Quick Start

Prerequisites

Run

Manual Start

Features

Fetching

Sessions

Proxies

Domain Blocking

Extraction

Export

Usage

Project Structure

API

Configuration

Limitations

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages