MailMiner — Web Email Extractor

A production-grade, async web crawler that extracts emails from websites using intelligent prioritization, concurrency, and structured outputs — available via both CLI and live web app.

🌐 Live App: https://mail-miner.onrender.com/
💻 GitHub: https://github.com/IshaanPathak25/Mail-Miner

🚀 Overview

MailMiner is an advanced email extraction system designed to transform unstructured web content into structured, usable data.

Unlike basic scrapers, it features:

Async concurrent crawling
Priority-based URL traversal
Source-aware email tracking
Analytics and crawl reporting
CLI + Web interface

⚡ Features

🔍 Intelligent Crawling

Async crawling using asyncio + aiohttp
Configurable worker pool for concurrency
Domain-restricted crawling
Depth and page-limit control

🧠 Priority-Based Extraction

High-value pages prioritized:
- contact, faculty, directory, staff
Low-value pages deprioritized:
- gallery, events, news

📧 Email Extraction

Extracts from:
- Page text (regex)
- mailto: links
Source tracking: email → source URL(s)

📊 Structured Outputs

Excel export (.xlsx)
Crawl report (report.json)
Crawl log (crawl_log.json)

🖥️ Dual Interface

CLI Mode

python main.py https://example.com

Web App

Clean UI
Download Options:
- Excel
- Excel + report
- Excel + report + crawl log

🧰 Tech Stack

Backend: Python, FastAPI
Concurrency: asyncio, aiohttp
Parsing: BeautifulSoup, lmxl
Data Processing: Regex
Export: openpyxl
Deployment: Render
Frontend: HTML

🚀 Usage

CLI Usage:

Basic
```
python main.py https://example.com
```

Full Control

python main.py https://example.com \
--depth 3 \
--max-pages 50 \
--workers 12 \
--output-dir ./results \
--verbose

Summary Only

python main.py https://example.com --only-summary

Top Pages Insight

python main.py https://example.com --top-pages 5

📊 Output Format

Excel File | Email | Source URL |

Crawl Report (report.json)

{
   "total_pages_crawled": 20,
   "total_emails_found": 321,
   "unique_emails": 300,
   "top_email_domains": {
     "nitt.edu": 120
   }
}

Crawl Log (crawl_log.json)

[
  {
    "url": "...",
    "depth": 2,
    "emails_found": 12
  }
]

🧠 How It Works

Pipeline

Fetch — Async page retrieval
Parse — Extract emails via regex + HTML parsing
Clean — Normalize and deduplicate
Track — Map emails to source pages
Prioritize — Rank URLs dynamically
Export — Generate structured outputs

📁 Project Structure

web-email-extractor/
│
├── app.py                 # FastAPI web server
├── main.py                # CLI entry point
├── requirements.txt
├── Procfile
│
├── templates/             # Web UI
│   └── index.html
│
└── extractor/
    ├── crawler.py
    ├── fetcher.py
    ├── parser.py
    ├── cleaner.py
    ├── exporter.py
    ├── validator.py
    └── cli.py

⚠️ Limitations

No JavaScript rendering (static HTML only)
Performance depends on target website structure
Free-tier deployment may timeout on large crawls

🔮 Future Improvements

Adaptive crawling (self-learning priorities)
JavaScript rendering (Playwright fallback)
Email classification (personal vs institutional)
UI-based result preview before download

🧪 Example Use Cases

Academic institution email extraction
Lead generation
Directory scraping
Data collection for research

📦 Installation

git clone https://github.com/IshaanPathak25/web-email-extractor
cd web-email-extractor
pip install -r requirements.txt

▶️ Run Locally

CLI

python main.py https://example.com

Web App

uvicorn app:app --reload

🧠 Key Highlights

Async architecture for high efficiency
Priority-based crawling for smarter extraction
Source-aware email tracking
Deployable web interface
Modular, scalable design

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
extractor		extractor
templates		templates
.gitignore		.gitignore
PROCFILE		PROCFILE
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MailMiner — Web Email Extractor

🚀 Overview

⚡ Features

🔍 Intelligent Crawling

🧠 Priority-Based Extraction

📧 Email Extraction

📊 Structured Outputs

🖥️ Dual Interface

CLI Mode

Web App

🧰 Tech Stack

🚀 Usage

CLI Usage:

📊 Output Format

🧠 How It Works

Pipeline

📁 Project Structure

⚠️ Limitations

🔮 Future Improvements

🧪 Example Use Cases

📦 Installation

▶️ Run Locally

CLI

Web App

🧠 Key Highlights

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MailMiner — Web Email Extractor

🚀 Overview

⚡ Features

🔍 Intelligent Crawling

🧠 Priority-Based Extraction

📧 Email Extraction

📊 Structured Outputs

🖥️ Dual Interface

CLI Mode

Web App

🧰 Tech Stack

🚀 Usage

CLI Usage:

📊 Output Format

🧠 How It Works

Pipeline

📁 Project Structure

⚠️ Limitations

🔮 Future Improvements

🧪 Example Use Cases

📦 Installation

▶️ Run Locally

CLI

Web App

🧠 Key Highlights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages