This project automates the workflow of finding YouTube videos related to an article, extracting transcripts, and saving metadata into structured JSON files.
It combines:
- Scraping (
scrapper.py) - Article parsing (TXT, PDF, DOCX)
- Keyword generation via OpenRouter GPT API
- YouTube search & transcripts (via
yt_dlp+youtube_transcript_api) - Logging system (avoid re-processing the same videos)
-
Load research articles (
.txt,.pdf,.docx) -
Extract a smart keyword phrase using OpenRouter GPT
-
Search YouTube for top 3 relevant videos
-
Fetch transcripts:
- First try
youtube_transcript_api - Fallback to
yt_dlpsubtitles - Auto-translate to English if needed
- First try
-
Save results as JSON (
results/<article>.json) -
Maintain logs:
- Processed videos →
logs/videos.log - Scraper logs →
logs/scraper.log
- Processed videos →
.
├── Article/ # Input folder (research articles)
│ └── example.pdf
├── results/ # Output folder (JSON metadata & transcripts)
├── logs/ # Logging (processed videos & scraper logs)
│ ├── videos.log
│ └── scraper.log
├── scrapper.py # Custom scraper script (provided separately)
├── main.py # Main script (this file)
└── README.md # Documentation
-
Clone repo and install dependencies:
git clone <repo-url> cd <repo> pip install -r requirements.txt
-
Create
.envfile with your OpenRouter API key:OPENROUTER_API_KEY=your_api_key_here
-
Add at least one article file (
.txt,.pdf, or.docx) inArticle/.
Run the main script:
python main.pySteps performed:
- Runs the web scraper (
scrapper.py) - Loads the first article from
Article/ - Extracts best YouTube search phrase
- Fetches top videos & transcripts
- Saves results to
results/<article>.json
[
{
"title": "AI Research Breakthrough 2025",
"url": "https://www.youtube.com/watch?v=abcd1234",
"views": 120394,
"published": "20250801",
"duration": 600,
"channel": "AI News",
"transcript": "In this video, we explore the latest research..."
}
]logs/videos.log→ Keeps track of already processed videos (avoids duplicates)logs/scraper.log→ Handled internally byscrapper.py
- If no
OPENROUTER_API_KEYis found, it defaults to"latest research update"as the query. - If no transcript is available, a placeholder message is saved.
scrapper.pymust be implemented with a functionscrape_links()(imported inmain.py).
-
Python 3.8+
-
Libraries:
yt-dlp deep-translator youtube-transcript-api python-dotenv PyPDF2 python-docx requests