π Live Project: This backend powers threads.anshumani.com β a web application that visualizes invisible threads across podcasts and essays.
Extracts "invisible threads" from podcasts and essays β non-obvious insights that connect multiple conversations. This repository contains the backend pipeline that powers the frontend visualization at threads.anshumani.com.
Works with two content types:
- π§ YouTube podcasts (transcript-based) - Links insights to exact video timestamps
- π Essays/blogs (website-based) - Links insights to source URLs
- 465 high-quality insights extracted from 13,513 chunks (3.4% rate, avg novelty 8.1/10)
- 20 invisible threads connecting 116 insights across 75+ episodes
- 8 major threads (3+ insights each)
- 12 emerging threads (2 insights each)
- 25% coverage β 1 in 4 insights forms part of a cross-episode thread
- Direct YouTube links β each insight links to the exact timestamp in the source video
- 270 high-quality insights extracted from 4,415 chunks (6.1% rate)
- 22 invisible threads connecting 170 insights across 70+ essays
- 63% coverage β higher density than podcasts due to written content
- Direct essay links β each insight links to the source essay URL
Total across both sources:
- 735 insights discovered
- 42 invisible threads found
- 286 insights connected across conversations
This pipeline powers Invisible Threads, a web application where you can:
- π Browse all 20 discovered threads
- π§ Explore insights from Lenny's Podcast (465 insights across 303 episodes)
- π Read insights from Paul Graham's Essays (270 insights across 228 essays)
- π Click through to exact timestamps in YouTube videos or essay sources
- π§ Discover non-obvious patterns across hundreds of conversations
Frontend Repository: The web interface is built separately and consumes the JSON files generated by this pipeline.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set up Modal account (for GPU inference)
# Sign up at https://modal.com and run:
modal setup
# 3. Prepare your database
# You'll need a SQLite database with:
# - chunks table: text chunks with document_id (+ timestamp_start for videos)
# - documents table: metadata with video_url or essay_url
# 4a. Run pipeline for podcasts/videos
modal run modal_extract.py --db your_database.db
python find_threads_v2.py --input data/modal_extraction_*.json
modal run name_clusters.py --input data/threads_*.json
# 4b. Run pipeline for essays/blogs
modal run modal_extract_pg.py --db your_database.db
python find_threads_v2.py --input data/pg_extraction_*.json --min-episodes 2
modal run name_clusters.py --input data/threads_*.json --min-episodes 2| File | Purpose | Command |
|---|---|---|
modal_extract.py |
Extract insights from podcasts | modal run modal_extract.py --db your_database.db |
modal_extract_pg.py |
Extract insights from essays/blogs | modal run modal_extract_pg.py --db your_database.db |
find_threads_v2.py |
Discover connected threads | python find_threads_v2.py --input data/modal_extraction_*.json |
name_clusters.py |
Name discovered threads | modal run name_clusters.py --input data/threads_*.json |
check_thread_quality.py |
Validate thread quality | modal run check_thread_quality.py --input data/named_threads_*.json |
add_thread_descriptions.py |
Add descriptions to 2-insight threads | modal run add_thread_descriptions.py |
| File | Purpose |
|---|---|
enrich_with_video.py |
Add video URLs to existing podcast data |
create_final_export.py |
Create curated final output |
create_clean_threads_v2.py |
Clean and deduplicate threads |
fix_pg_threads.py |
Fix Paul Graham essay threads |
merge_pairs.py |
Merge thread pairs |
list_threads.py |
List all threads in a file |
| File | Status |
|---|---|
find_debates.py |
Experimental - debate detection has high false positive rate |
validate_debates.py |
Helper for debate validation |
find_threads.py |
Legacy - replaced by find_threads_v2.py |
Final Output (in data/ directory):
Lenny's Podcast:
threads_final.jsonβ 20 curated threads (116 insights) β USE THIS FOR PRODUCTIONmodal_extraction_20260120_024600.jsonβ 465 extracted insights with YouTube timestamps
Paul Graham Essays:
pg_threads_final.jsonβ 22 curated threads (170 insights)pg_extraction_*.jsonβ 270 extracted insights with essay URLs
Intermediate Files (in data/ directory):
threads_v2_*.jsonβ Raw threading output before curationnamed_threads_v2_*.jsonβ Threads with LLM-generated namesquality_check_*.jsonβ Quality validation results
Note: Data files are excluded from git (see .gitignore). You'll need to run the pipeline to generate them.
| File | Contents |
|---|---|
README.md |
This file - quick start and pipeline overview |
PROJECT_LOG.md |
Complete project history, decisions, what worked/didn't work |
FINAL_SUMMARY.md |
Executive summary of results and technical decisions |
The pipeline requires a SQLite database with the following schema:
idβ Unique chunk identifiertextβ Chunk text (transcript or essay)document_idβ Reference to source documenttimestamp_startβ (Optional) Starting timestamp for videos (HH:MM:SS format)
idβ Unique document identifiertitleβ Document title (episode or essay)video_urlORessay_urlβ Source URL- Additional metadata fields as needed
Examples:
lennys_full.dbβ 13,513 chunks from 303 podcast episodespg_essays.dbβ 4,415 chunks from 228 Paul Graham essays
- Podcasts: 500 words per chunk (conversational, less dense)
- Essays: 150 words per chunk (dense, polished writing)
Smaller chunks for essays allow each distinct idea to be evaluated separately. At 500 words, essays often contain 3-4 ideas and the LLM must pick one.
- Extract insights first β strict LLM filtering (must be SPECIFIC + NON-OBVIOUS + ACTIONABLE)
- Extract topics from insights β LLM extracts core topic/claim from each insight
- Embed topics (not insights) β this captures conceptual similarity vs vocabulary overlap
- Graph-based threading β Louvain community detection finds natural clusters
- Multi-episode filtering β threads must span 2+ different episodes (min_episodes=2)
- Size threshold β accept threads with 2+ insights (min_size=2)
- Same-guest filtering β remove 2-insight threads from the same guest (likely duplicates)
- Deduplication β keep only best insight per episode per thread
- Manual curation β LLM naming failed, thread names curated by hand
- Video linking β each insight includes
timestamp_urlfor direct YouTube playback
Problem discovered: Embedding full insights captures vocabulary overlap, not conceptual similarity.
- 99th percentile insight similarity was only 0.496
- Different guests using different words for same concept β low embedding similarity
- High similarity = near-duplicates, not conceptual connections
Solution: Extract topic/claim from each insight β embed the topic β cluster by topic similarity.
- This captures conceptual similarity regardless of vocabulary
- Improved coverage from 8% to 54% (before filtering)
- Final output: 20% coverage after quality filtering
- Multi-episode requirement: Min 2 different episodes (no single-source "threads")
- Same-guest filtering: Remove 2-insight threads from same guest (3 filtered out)
- Deduplication: Max 1 insight per episode per thread
- Remove NO_CLEAR_THREAD: LLM couldn't find coherent theme
- Manual name curation: LLM produced ALL_CAPS_WITH_UNDERSCORES
Result: 8 major threads (3+ insights) + 12 emerging threads (2 insights) = 20 total threads
Every insight includes:
video_urlβ base YouTube URLtimestamp_startβ HH:MM:SS timestamptimestamp_urlβ clickable link likehttps://youtube.com/watch?v=xyz&t=1633
This is included by default in modal_extract.py. For existing data, run enrich_with_video.py.
Every insight includes:
essay_urlβ direct link to the source essaytitleβ essay title
This is included by default in modal_extract_pg.py.
The pipeline works for both content types with one key adaptation:
| Aspect | Podcasts | Essays |
|---|---|---|
| Chunk size | 500 words | 150 words |
| Extraction rate | 3.4% | 6.1% |
| Coverage | 25% | 63% |
| Content style | Conversational | Dense/polished |
| Links | YouTube timestamps | Essay URLs |
| Multiple voices | Yes (guests) | No (single author) |
| Extractor | modal_extract.py |
modal_extract_pg.py |
| Database | lennys_full.db |
pg_essays.db |
Why smaller chunks for essays?
Essays are dense β at 500 words, a chunk often contains 3-4 distinct ideas. The LLM has to pick ONE, potentially missing the others. Smaller chunks (150 words) let each idea get evaluated separately.
Why higher yield for essays?
Written content is more intentional with fewer filler words and tangents compared to conversational podcasts.
Attempted but failed. The debate detection approach had a fatal flaw:
Mechanism:
- Extract TOPIC + STANCE from each insight (LLM)
- Group insights by topic similarity (embeddings)
- Check if stances within group are opposed (LLM)
Problem: The LLM opposition checker treats ANY difference as opposition:
- "Prioritize user feedback" vs "Don't solely rely on feedback" β marked as opposition
- Reality: Both recommend using feedback, just different emphasis
- Even identical positions with different wording were marked as oppositions
Evidence: 9.5% of checked pairs marked as "genuine opposition", but 95%+ were false positives.
Conclusion: The approach successfully groups related insights but cannot distinguish:
- Genuine opposition ("Do X" vs "Don't do X")
- Different emphasis ("Prioritize X" vs "Balance X and Y")
- Complementary views ("Focus on X" vs "Also consider Y")
Result: Debates removed from final output. Only threads retained.
- Python 3.8+
- Modal account (for GPU inference) β https://modal.com
- Dependencies in
requirements.txt:modalβ GPU inference platformsentence-transformersβ Embedding modelnetworkxβ Graph algorithmspython-louvainβ Community detectionscikit-learnβ Similarity computations
# Clone the repository
git clone https://github.com/baboonzero/invisible-threads.git
cd invisible-threads
# Install dependencies
pip install -r requirements.txt
# Set up Modal (requires account)
modal setup