A production-grade, async web crawler that extracts emails from websites using intelligent prioritization, concurrency, and structured outputs — available via both CLI and live web app.
🌐 Live App: https://mail-miner.onrender.com/
💻 GitHub: https://github.com/IshaanPathak25/Mail-Miner
MailMiner is an advanced email extraction system designed to transform unstructured web content into structured, usable data.
Unlike basic scrapers, it features:
- Async concurrent crawling
- Priority-based URL traversal
- Source-aware email tracking
- Analytics and crawl reporting
- CLI + Web interface
- Async crawling using
asyncio + aiohttp - Configurable worker pool for concurrency
- Domain-restricted crawling
- Depth and page-limit control
- High-value pages prioritized:
contact,faculty,directory,staff
- Low-value pages deprioritized:
gallery,events,news
- Extracts from:
- Page text (regex)
mailto:links
- Source tracking: email → source URL(s)
- Excel export (
.xlsx) - Crawl report (
report.json) - Crawl log (
crawl_log.json)
python main.py https://example.com- Clean UI
- Download Options:
- Excel
- Excel + report
- Excel + report + crawl log
- Backend: Python, FastAPI
- Concurrency: asyncio, aiohttp
- Parsing: BeautifulSoup, lmxl
- Data Processing: Regex
- Export: openpyxl
- Deployment: Render
- Frontend: HTML
-
Basic
python main.py https://example.com
-
Full Control
python main.py https://example.com \ --depth 3 \ --max-pages 50 \ --workers 12 \ --output-dir ./results \ --verbose
-
Summary Only
python main.py https://example.com --only-summary
-
Top Pages Insight
python main.py https://example.com --top-pages 5
-
Excel File | Email | Source URL |
-
Crawl Report (report.json)
{ "total_pages_crawled": 20, "total_emails_found": 321, "unique_emails": 300, "top_email_domains": { "nitt.edu": 120 } } -
Crawl Log (crawl_log.json)
[ { "url": "...", "depth": 2, "emails_found": 12 } ]
- Fetch — Async page retrieval
- Parse — Extract emails via regex + HTML parsing
- Clean — Normalize and deduplicate
- Track — Map emails to source pages
- Prioritize — Rank URLs dynamically
- Export — Generate structured outputs
web-email-extractor/
│
├── app.py # FastAPI web server
├── main.py # CLI entry point
├── requirements.txt
├── Procfile
│
├── templates/ # Web UI
│ └── index.html
│
└── extractor/
├── crawler.py
├── fetcher.py
├── parser.py
├── cleaner.py
├── exporter.py
├── validator.py
└── cli.py
- No JavaScript rendering (static HTML only)
- Performance depends on target website structure
- Free-tier deployment may timeout on large crawls
- Adaptive crawling (self-learning priorities)
- JavaScript rendering (Playwright fallback)
- Email classification (personal vs institutional)
- UI-based result preview before download
- Academic institution email extraction
- Lead generation
- Directory scraping
- Data collection for research
git clone https://github.com/IshaanPathak25/web-email-extractor
cd web-email-extractor
pip install -r requirements.txtpython main.py https://example.comuvicorn app:app --reload- Async architecture for high efficiency
- Priority-based crawling for smarter extraction
- Source-aware email tracking
- Deployable web interface
- Modular, scalable design