This project contains two web crawlers for collecting fact-check articles from:
- Snopes
- PolitiFact
The crawlers extract:
- Article metadata
- Claims
- Verdict labels
- Article content
- Sources / references
- Hyperlinks
The extracted articles are saved as structured JSON datasets for downstream NLP and fact-checking experiments.
Extracts:
- Title
- Author
- Date
- Byline
- Claim reviewed
- Verdict / rating
- Article content
- Sources section
- Source hyperlinks
Supports:
- Pagination crawling
- Timestamp filtering
- Category filtering
Extracts:
- Speaker
- Claim
- Verdict label
- Author
- Date
- "If Your Time Is Short"
- Full article content
- "Our Ruling"
- Sources section
- Source hyperlinks
Supports:
- Pagination crawling
- Timestamp filtering
- Automatic English-only filtering
Install dependencies:
pip install requests beautifulsoup4PROJECT/
│
├── SnopesCrawl.py
├── PolitifactCrawl.py
├── DATASET/
│ ├── snopes_articles.json
│ └── politifact_articles.json
└── README.md
Each page contains 20 articles.
python SnopesCrawl.pyExample with timestamp filtering:
Each page contains 30 articles
python SnopesCrawl.py --time-stamp 20250101Example with page limit:
python SnopesCrawl.py --limit 50python PolitifactCrawl.pyExample with timestamp filtering:
python PolitifactCrawl.py --time-stamp 20250101Example with page limit:
python PolitifactCrawl.py --limit 50| Argument | Description |
|---|---|
--output-dir |
Output dataset directory |
--time-stamp |
Filter articles after a date (YYYYMMDD) |
--limit |
Maximum number of pages to crawl |
--category |
Article category (used mainly for Snopes) |
Example article structure:
{
"speaker": "TikTok posts",
"date": "May 28, 2026",
"article_url": "...",
"article_data": {
"claim": "...",
"rating": "False",
"author": "Maria Briceño",
"content": {
"if_your_time_is_short": [],
"article_content": [],
"our_ruling": []
},
"sources": [
{
"raw_text": "...",
"links": "https://..."
}
]
}
}Example article structure:
{
"title": "Thieves aren't using perfume to knock out victims, despite persistent rumors",
"article_url": "https://www.snopes.com/fact-check/thieves-perfume-shock-victims/",
"author": "Joey Esposito",
"date": "May 28, 2026",
"byline": "Versions of this longstanding urban legend have lurked on the web since the 1990s.",
"article_data": {
"claim": "Thieves operating in public places are using drug-filled perfume bottles to render their victims unconscious.",
"rating": "False",
"author": "Joey Esposito",
"datePublished": "2026-05-28T08:00:06Z",
"content": "Full article text...",
"sources": [
"CDC Health-Related Hoaxes & Rumors. https://webharvest.gov/...",
"Drugs@FDA: FDA-Approved Drugs. https://www.accessdata.fda.gov/...",
"\"Hydroxyzine (Oral Route).\" Mayo Clinic, https://www.mayoclinic.org/..."
]
}
}The crawlers support filtering articles newer than a given date.
Example:
python PolitifactCrawl.py --time-stamp 20250101This collects only articles published after:
2025-01-01