Skip to content

Discovers "invisible threads" - non-obvious insights that connect across 300+ podcast episodes or a collection of essays using LLM extraction and graph-based clustering. Extract insights, find connections, discover patterns.

License

Notifications You must be signed in to change notification settings

baboonzero/invisible-threads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Invisible Threads

🌐 Live Project: This backend powers threads.anshumani.com β€” a web application that visualizes invisible threads across podcasts and essays.

Invisible Threads - Find the Signal in the Noise

About

Extracts "invisible threads" from podcasts and essays β€” non-obvious insights that connect multiple conversations. This repository contains the backend pipeline that powers the frontend visualization at threads.anshumani.com.

Works with two content types:

  • 🎧 YouTube podcasts (transcript-based) - Links insights to exact video timestamps
  • πŸ“ Essays/blogs (website-based) - Links insights to source URLs

Final Results

🎧 Lenny's Podcast

  • 465 high-quality insights extracted from 13,513 chunks (3.4% rate, avg novelty 8.1/10)
  • 20 invisible threads connecting 116 insights across 75+ episodes
    • 8 major threads (3+ insights each)
    • 12 emerging threads (2 insights each)
  • 25% coverage β€” 1 in 4 insights forms part of a cross-episode thread
  • Direct YouTube links β€” each insight links to the exact timestamp in the source video

πŸ“ Paul Graham Essays

  • 270 high-quality insights extracted from 4,415 chunks (6.1% rate)
  • 22 invisible threads connecting 170 insights across 70+ essays
  • 63% coverage β€” higher density than podcasts due to written content
  • Direct essay links β€” each insight links to the source essay URL

Total across both sources:

  • 735 insights discovered
  • 42 invisible threads found
  • 286 insights connected across conversations

🌐 Live Website

This pipeline powers Invisible Threads, a web application where you can:

  • πŸ” Browse all 20 discovered threads
  • 🎧 Explore insights from Lenny's Podcast (465 insights across 303 episodes)
  • πŸ“ Read insights from Paul Graham's Essays (270 insights across 228 essays)
  • πŸ”— Click through to exact timestamps in YouTube videos or essay sources
  • 🧠 Discover non-obvious patterns across hundreds of conversations

Frontend Repository: The web interface is built separately and consumes the JSON files generated by this pipeline.

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up Modal account (for GPU inference)
# Sign up at https://modal.com and run:
modal setup

# 3. Prepare your database
# You'll need a SQLite database with:
# - chunks table: text chunks with document_id (+ timestamp_start for videos)
# - documents table: metadata with video_url or essay_url

# 4a. Run pipeline for podcasts/videos
modal run modal_extract.py --db your_database.db
python find_threads_v2.py --input data/modal_extraction_*.json
modal run name_clusters.py --input data/threads_*.json

# 4b. Run pipeline for essays/blogs
modal run modal_extract_pg.py --db your_database.db
python find_threads_v2.py --input data/pg_extraction_*.json --min-episodes 2
modal run name_clusters.py --input data/threads_*.json --min-episodes 2

Pipeline Scripts

Core Pipeline (run in order)

File Purpose Command
modal_extract.py Extract insights from podcasts modal run modal_extract.py --db your_database.db
modal_extract_pg.py Extract insights from essays/blogs modal run modal_extract_pg.py --db your_database.db
find_threads_v2.py Discover connected threads python find_threads_v2.py --input data/modal_extraction_*.json
name_clusters.py Name discovered threads modal run name_clusters.py --input data/threads_*.json
check_thread_quality.py Validate thread quality modal run check_thread_quality.py --input data/named_threads_*.json
add_thread_descriptions.py Add descriptions to 2-insight threads modal run add_thread_descriptions.py

Utility Scripts

File Purpose
enrich_with_video.py Add video URLs to existing podcast data
create_final_export.py Create curated final output
create_clean_threads_v2.py Clean and deduplicate threads
fix_pg_threads.py Fix Paul Graham essay threads
merge_pairs.py Merge thread pairs
list_threads.py List all threads in a file

Experimental/Legacy

File Status
find_debates.py Experimental - debate detection has high false positive rate
validate_debates.py Helper for debate validation
find_threads.py Legacy - replaced by find_threads_v2.py

Output Data

Final Output (in data/ directory):

Lenny's Podcast:

  • threads_final.json β€” 20 curated threads (116 insights) β€” USE THIS FOR PRODUCTION
  • modal_extraction_20260120_024600.json β€” 465 extracted insights with YouTube timestamps

Paul Graham Essays:

  • pg_threads_final.json β€” 22 curated threads (170 insights)
  • pg_extraction_*.json β€” 270 extracted insights with essay URLs

Intermediate Files (in data/ directory):

  • threads_v2_*.json β€” Raw threading output before curation
  • named_threads_v2_*.json β€” Threads with LLM-generated names
  • quality_check_*.json β€” Quality validation results

Note: Data files are excluded from git (see .gitignore). You'll need to run the pipeline to generate them.

Documentation

File Contents
README.md This file - quick start and pipeline overview
PROJECT_LOG.md Complete project history, decisions, what worked/didn't work
FINAL_SUMMARY.md Executive summary of results and technical decisions

Database Requirements

The pipeline requires a SQLite database with the following schema:

chunks table

  • id β€” Unique chunk identifier
  • text β€” Chunk text (transcript or essay)
  • document_id β€” Reference to source document
  • timestamp_start β€” (Optional) Starting timestamp for videos (HH:MM:SS format)

documents table

  • id β€” Unique document identifier
  • title β€” Document title (episode or essay)
  • video_url OR essay_url β€” Source URL
  • Additional metadata fields as needed

Examples:

  • lennys_full.db β€” 13,513 chunks from 303 podcast episodes
  • pg_essays.db β€” 4,415 chunks from 228 Paul Graham essays

Key Difference: Chunk Size

  • Podcasts: 500 words per chunk (conversational, less dense)
  • Essays: 150 words per chunk (dense, polished writing)

Smaller chunks for essays allow each distinct idea to be evaluated separately. At 500 words, essays often contain 3-4 ideas and the LLM must pick one.

Key Approach

  1. Extract insights first β€” strict LLM filtering (must be SPECIFIC + NON-OBVIOUS + ACTIONABLE)
  2. Extract topics from insights β€” LLM extracts core topic/claim from each insight
  3. Embed topics (not insights) β€” this captures conceptual similarity vs vocabulary overlap
  4. Graph-based threading β€” Louvain community detection finds natural clusters
  5. Multi-episode filtering β€” threads must span 2+ different episodes (min_episodes=2)
  6. Size threshold β€” accept threads with 2+ insights (min_size=2)
  7. Same-guest filtering β€” remove 2-insight threads from the same guest (likely duplicates)
  8. Deduplication β€” keep only best insight per episode per thread
  9. Manual curation β€” LLM naming failed, thread names curated by hand
  10. Video linking β€” each insight includes timestamp_url for direct YouTube playback

Why Topic-Based Embedding?

Problem discovered: Embedding full insights captures vocabulary overlap, not conceptual similarity.

  • 99th percentile insight similarity was only 0.496
  • Different guests using different words for same concept β†’ low embedding similarity
  • High similarity = near-duplicates, not conceptual connections

Solution: Extract topic/claim from each insight β†’ embed the topic β†’ cluster by topic similarity.

  • This captures conceptual similarity regardless of vocabulary
  • Improved coverage from 8% to 54% (before filtering)
  • Final output: 20% coverage after quality filtering

Quality Filtering

  1. Multi-episode requirement: Min 2 different episodes (no single-source "threads")
  2. Same-guest filtering: Remove 2-insight threads from same guest (3 filtered out)
  3. Deduplication: Max 1 insight per episode per thread
  4. Remove NO_CLEAR_THREAD: LLM couldn't find coherent theme
  5. Manual name curation: LLM produced ALL_CAPS_WITH_UNDERSCORES

Result: 8 major threads (3+ insights) + 12 emerging threads (2 insights) = 20 total threads

Source Links

For Podcasts

Every insight includes:

  • video_url β€” base YouTube URL
  • timestamp_start β€” HH:MM:SS timestamp
  • timestamp_url β€” clickable link like https://youtube.com/watch?v=xyz&t=1633

This is included by default in modal_extract.py. For existing data, run enrich_with_video.py.

For Essays

Every insight includes:

  • essay_url β€” direct link to the source essay
  • title β€” essay title

This is included by default in modal_extract_pg.py.

Podcasts vs Essays: Key Differences

The pipeline works for both content types with one key adaptation:

Aspect Podcasts Essays
Chunk size 500 words 150 words
Extraction rate 3.4% 6.1%
Coverage 25% 63%
Content style Conversational Dense/polished
Links YouTube timestamps Essay URLs
Multiple voices Yes (guests) No (single author)
Extractor modal_extract.py modal_extract_pg.py
Database lennys_full.db pg_essays.db

Why smaller chunks for essays?

Essays are dense β€” at 500 words, a chunk often contains 3-4 distinct ideas. The LLM has to pick ONE, potentially missing the others. Smaller chunks (150 words) let each idea get evaluated separately.

Why higher yield for essays?

Written content is more intentional with fewer filler words and tangents compared to conversational podcasts.

What About Debates?

Attempted but failed. The debate detection approach had a fatal flaw:

Mechanism:

  1. Extract TOPIC + STANCE from each insight (LLM)
  2. Group insights by topic similarity (embeddings)
  3. Check if stances within group are opposed (LLM)

Problem: The LLM opposition checker treats ANY difference as opposition:

  • "Prioritize user feedback" vs "Don't solely rely on feedback" β†’ marked as opposition
  • Reality: Both recommend using feedback, just different emphasis
  • Even identical positions with different wording were marked as oppositions

Evidence: 9.5% of checked pairs marked as "genuine opposition", but 95%+ were false positives.

Conclusion: The approach successfully groups related insights but cannot distinguish:

  • Genuine opposition ("Do X" vs "Don't do X")
  • Different emphasis ("Prioritize X" vs "Balance X and Y")
  • Complementary views ("Focus on X" vs "Also consider Y")

Result: Debates removed from final output. Only threads retained.

Requirements

  • Python 3.8+
  • Modal account (for GPU inference) β€” https://modal.com
  • Dependencies in requirements.txt:
    • modal β€” GPU inference platform
    • sentence-transformers β€” Embedding model
    • networkx β€” Graph algorithms
    • python-louvain β€” Community detection
    • scikit-learn β€” Similarity computations

Installation

# Clone the repository
git clone https://github.com/baboonzero/invisible-threads.git
cd invisible-threads

# Install dependencies
pip install -r requirements.txt

# Set up Modal (requires account)
modal setup

About

Discovers "invisible threads" - non-obvious insights that connect across 300+ podcast episodes or a collection of essays using LLM extraction and graph-based clustering. Extract insights, find connections, discover patterns.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages