GitHub .patch Email Scraper

A Python tool for extracting email addresses and usernames from GitHub .patch URLs.

Features

✅ Extracts email and username from line 2 of .patch files
✅ Automatic discovery of random GitHub commits
✅ Skips noreply emails automatically
✅ Deduplicates by email (skips if already in database)
✅ Async HTTP fetching with retry logic
✅ SQLite storage for collected emails
✅ CLI interface with Typer
✅ Unit tests

Quick Start

1. Install dependencies

python3 -m pip install -r requirements.txt

Or use the quick start script:

./quickstart.sh

2. Fetch a .patch URL

./scraper.sh fetch https://github.com/owner/repo/pull/123.patch

Or with full python command:

PYTHONPATH=src python3 -m scraper.cli fetch https://github.com/owner/repo/pull/123.patch

With authentication (recommended to avoid rate limits):

./scraper.sh fetch https://github.com/owner/repo/pull/123.patch --token YOUR_GITHUB_TOKEN

3. Auto-fetch random commits (NEW!)

Automatically discover and scrape random GitHub commits:

# Fetch 10 random commits
./scraper.sh auto-fetch --count 10 --token YOUR_GITHUB_TOKEN

# Fetch from popular repositories
./scraper.sh auto-fetch --count 20 --mode popular --token YOUR_TOKEN

# Filter by language and topic
./scraper.sh auto-fetch --count 15 --language python --topic web --token YOUR_TOKEN

Note: Token is highly recommended for auto-fetch to avoid rate limits!

4. List collected emails

./scraper.sh list-patches

How It Works

Manual Fetch

Fetch specific .patch URLs directly.

Auto-Fetch (Automatic Discovery)

The scraper can automatically discover random commits from GitHub:

Random Mode: Searches GitHub for repos by language/topic, then fetches recent commits
Popular Mode: Fetches from a curated list of popular repositories (Linux, Python, Node.js, etc.)

The scraper extracts information from line 2 of GitHub .patch files, which has this format:

From: username <email@domain.com>

Filtering Rules

Skip if email exists: Emails are deduplicated automatically
Skip noreply emails: Any email containing "noreply" is skipped
Extract only line 2: Only the author information from line 2 is stored

Storage

Data is stored in data/patches.db (SQLite) with this simple schema:

Column	Type	Description
id	INTEGER	Primary key
email	TEXT	Email address (unique)
username	TEXT	Username/display name
created_at	TEXT	ISO 8601 timestamp

Example Output

Manual Fetch

$ ./scraper.sh fetch https://github.com/psf/requests/pull/6000.patch

Fetching https://github.com/psf/requests/pull/6000.patch...
✓ Saved to database (id=1)
  Email: developer@example.com
  Username: John Doe

Auto-Fetch

$ ./scraper.sh auto-fetch --count 5 --token YOUR_TOKEN

🚀 Auto-fetching 5 patches (mode: random)

🔍 Searching repos: language=python, topic=web
✓ Found 10 repositories

📥 Fetching: https://github.com/owner/repo/commit/abc123.patch
   ✓ Saved (id=1): dev@example.com (developer)
📥 Fetching: https://github.com/owner2/repo2/commit/def456.patch
   ⊘ Skipped: noreply email
📥 Fetching: https://github.com/owner3/repo3/commit/ghi789.patch
   ✓ Saved (id=2): user@domain.com (username)

Summary:
  ✓ Fetched: 2
  ⊘ Skipped: 1
  ✗ Errors: 0

📊 Total emails in database: 2

Skipping scenarios:

# Noreply email
⊘ Skipped: noreply email (noreply@github.com)

# Duplicate email
⊘ Skipped: email already in database (developer@example.com)

Running Tests

PYTHONPATH=src pytest -v

Or quick run:

PYTHONPATH=src pytest -q

Tech Stack

Python 3.9+ with async/await
httpx - Async HTTP client with retry logic
Typer + Rich - Modern CLI with beautiful output
SQLite - Embedded database
pytest - Testing framework

Architecture

src/scraper/
├── fetcher.py   # Async HTTP client with retry logic
├── parser.py    # Line 2 email/username extraction
├── storage.py   # SQLite storage with deduplication
├── discover.py  # GitHub API for auto-discovery (NEW!)
└── cli.py       # CLI interface

Rate Limits

GitHub has rate limits:

Unauthenticated: 60 requests/hour (not recommended for auto-fetch)
Authenticated: 5,000 requests/hour

Always use --token for auto-fetch to avoid hitting limits quickly!

Example Script

See examples/fetch_example.py for a complete working example:

python3 examples/fetch_example.py

Next Steps

Add batch processing (read URLs from a file)
Add worker queue for large-scale scraping
Export to CSV/JSON
Add web UI for browsing collected emails
Deploy as a Docker container

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
src/scraper		src/scraper
tests		tests
.gitignore		.gitignore
AUTO_FETCH_GUIDE.md		AUTO_FETCH_GUIDE.md
AUTO_FETCH_SUMMARY.md		AUTO_FETCH_SUMMARY.md
CHANGES.md		CHANGES.md
Dockerfile		Dockerfile
Example-.patch-content		Example-.patch-content
QUICKREF.md		QUICKREF.md
README.md		README.md
RELEASE_NOTES_v1.0.0.md		RELEASE_NOTES_v1.0.0.md
demo_autofetch.py		demo_autofetch.py
pyproject.toml		pyproject.toml
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt
scraper.sh		scraper.sh
test_example.py		test_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitHub .patch Email Scraper

Features

Quick Start

1. Install dependencies

2. Fetch a .patch URL

3. Auto-fetch random commits (NEW!)

4. List collected emails

How It Works

Manual Fetch

Auto-Fetch (Automatic Discovery)

Filtering Rules

Storage

Example Output

Manual Fetch

Auto-Fetch

Running Tests

Tech Stack

Architecture

Rate Limits

Example Script

Next Steps

License

About

Uh oh!

Releases

Packages

Languages

Appaholics/github-mail-scraper

Folders and files

Latest commit

History

Repository files navigation

GitHub .patch Email Scraper

Features

Quick Start

1. Install dependencies

2. Fetch a .patch URL

3. Auto-fetch random commits (NEW!)

4. List collected emails

How It Works

Manual Fetch

Auto-Fetch (Automatic Discovery)

Filtering Rules

Storage

Example Output

Manual Fetch

Auto-Fetch

Running Tests

Tech Stack

Architecture

Rate Limits

Example Script

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages