Skip to content

david-spies/ScrapeGoat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ScrapeGoat_Home

🐐 ScrapeGoat

Web data extraction terminal. Three engines. Full session control. Zero framework overhead.

ScrapeGoat is a full-stack web scraper with a tech-noir terminal UI. The frontend is a single self-contained HTML file. The backend is a pure Python server implementing six scraping classes β€” from fast TLS-fingerprinted HTTP requests to full Playwright Chromium browser automation β€” with built-in session management, proxy rotation, and domain blocking.

No framework. No build step. One command to run.

Version Python React Playwright License


Engines

Engine Class Best For
Fetcher Fetcher / FetcherSession Static pages β€” fast HTTP with browser TLS fingerprint
Stealth StealthyFetcher / StealthySession Bot-protected pages β€” header rotation, timing jitter, IP spoofing
Dynamic DynamicFetcher / DynamicSession JS-rendered SPAs β€” full headless Chromium via Playwright

Quick Start

Prerequisites

  • Python 3.10+
  • pip

Run

# Clone or unzip the project
cd scrapegoat/

# macOS / Linux
bash start.sh

# Windows
start.bat

Open http://localhost:7331 in your browser.

That's it. start.sh installs all dependencies, installs the Playwright Chromium binary, and starts the server in one step.

Manual Start

pip install -r requirements.txt
playwright install chromium
python3 server.py

Features

Fetching

  • TLS fingerprint impersonation β€” 5 browser profiles (Chrome/Win, Chrome/Mac, Firefox, Safari, Edge) with exact Sec-CH-UA, Sec-Fetch-*, Accept headers
  • Dynamic loading β€” Playwright Chromium with stealth JS injection (removes navigator.webdriver, spoofs plugins, platform, permissions)
  • Anti-bot stealth β€” per-request UA rotation, randomised header insertion order, timing jitter (0–400ms), spoofed X-Forwarded-For / X-Real-IP headers

Sessions

  • Persistent cookies β€” HTTP sessions use http.cookiejar.CookieJar; Playwright sessions save/restore context.cookies() across requests
  • Named sessions β€” create a session ID in the UI; all requests with that ID share the same cookie state
  • Session inspection β€” sidebar shows active sessions; click any to reuse, or clear individually or all at once

Proxies

  • Proxy rotation β€” ProxyRotator with cyclic and random strategies
  • Per-request override β€” bypass the rotator for a single request
  • Live management β€” add/remove proxies from the UI without restarting the server
  • All fetcher types β€” proxy support on Fetcher, StealthyFetcher, and DynamicFetcher

Domain Blocking

  • HTTP fetchers β€” domain check runs before any request leaves the process
  • Playwright β€” blocked at the network level via page.route() abort
  • Subdomain matching β€” blocking example.com also blocks all subdomains

Extraction

  • Tag targeting β€” <p>, <h1>, <h2>, <div>, <a>, .class, #id, <td>, <tr>
  • Custom selectors β€” By.CLASS_NAME, By.CSS_SELECTOR, By.ID, By.LINK_TEXT, By.NAME, By.PARTIAL_LINK_TEXT, By.TAG_NAME
  • Smart deduplication β€” whitespace normalisation + set deduplication before returning
  • Script/style stripping β€” <script> and <style> removed before extraction

Export

Format Description
CSV Header + one quoted value per row
JSON Pretty-printed array
XLSX TSV β€” opens natively in Excel
SQL CREATE TABLE + INSERT INTO statements

Usage

  1. Enter a target URL in the sidebar
  2. Select engine β€” Fetcher, Stealth, or Dynamic
  3. Choose a browser profile (Fetcher and Stealth only)
  4. Check tags to extract
  5. Optionally set a custom selector with a By.* method
  6. Optionally set a Session ID for persistent cookies
  7. Add proxies and choose rotation strategy if needed
  8. Add blocked domains to suppress tracking or ad calls
  9. Hit EXECUTE β€” watch the terminal stream the scrape log
  10. Switch to RAW DATA tab to browse results
  11. Select export format and click DOWNLOAD

Project Structure

scrapegoat/
β”œβ”€β”€ scrapegoat.html      ← Full frontend (HTML + CSS + JS, no build step)
β”œβ”€β”€ server.py            ← Backend engine (stdlib + bs4 + playwright)
β”œβ”€β”€ requirements.txt     ← Python dependencies
β”œβ”€β”€ start.sh             ← macOS / Linux launcher
β”œβ”€β”€ start.bat            ← Windows launcher
β”œβ”€β”€ README.md            ← This file
β”œβ”€β”€ deployment-guide.md  ← File placement, deployment procedures, API reference
└── tech-stack.md        ← Architecture, class map, technology decisions

scrapegoat.html and server.py must live in the same directory.


API

Method Endpoint Description
GET / Serves scrapegoat.html
GET /api/health Health check
GET /api/profiles Lists browser fingerprint profiles
GET /api/sessions Lists active session IDs
POST /api/scrape Executes a scrape
POST /api/proxies Updates the proxy rotator pool
POST /api/sessions/clear Clears one or all sessions

Full schemas in deployment-guide.md.


Configuration

All settings are constants in server.py. No .env file required.

Setting Default Edit
Port 7331 PORT = 7331
Frontend path ./scrapegoat.html FRONTEND_PATH constant
HTTP request timeout 20s Fetcher.__init__ default arg
Playwright timeout 30s DynamicFetcher.__init__ default arg
Stealth jitter 400ms max StealthyFetcher.__init__ default arg
Headless mode True DynamicFetcher(headless=False)

Limitations

  • No HTTP/3 β€” Python urllib uses HTTP/1.1. Integrate curl-cffi for full HTTP/3 + TLS fingerprinting
  • XPATH selectors not supported β€” use By.CSS_SELECTOR instead
  • In-memory sessions only β€” state is lost on server restart
  • Single process β€” Playwright scrapes block one thread each

Roadmap

  • curl-cffi for HTTP/2 + HTTP/3 TLS fingerprinting
  • Persistent session storage (SQLite)
  • Pagination and recursive crawl
  • Scheduled jobs
  • Captcha solver integration

License

MIT β€” use it, fork it, scrape responsibly.

Authors

David Spies

About

ScrapeGoat is a full-stack web scraper with a tech-noir terminal UI. The frontend is a single self-contained HTML file. The backend is a pure Python server implementing six scraping classes. No framework. No build step. One command to run.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors