Want the easiest path? Double-click
start_export_studio_gui.bat(Windows) orstart_export_studio_gui.sh(macOS/Linux) to auto-create a virtualenv, install dependencies, and launch the GUI. Then jump to Explore Your Data below.
- Go to ChatGPT Settings → Data Controls → Export Data
- Wait for the email with your export ZIP
- Download the ZIP file (typically named
chat-data-export-yyyy-mm-dd.zip)
python3 export_studio.py import /path/to/your-export.zipThis will:
- Extract the ZIP
- Find
conversations.json - Parse and normalize all conversations
- Store in SQLite database with FTS5 index
- Extract metadata (intent, topics, flags)
python3 export_studio.py list --limit 20# Search for specific topics
python3 export_studio.py search "machine learning" --limit 10
# Search for code snippets
python3 export_studio.py search "python def" --limit 5python3 export_studio.py guiThe GUI provides:
- Browse all conversations
- Search with instant results
- View messages in chronological order
- Export datasets with one click
Before using chunks for semantic search or embeddings:
python3 export_studio.py chunkThis creates overlapping chunks of 800-1200 tokens (estimated) suitable for:
- Retrieval Augmented Generation (RAG)
- Embedding generation
- Context-aware search
python3 export_studio.py export corpus ./my_corpusCreates:
corpus.jsonl: Structured records with role, intent, topics, timestampscorpus.txt: Plain text format with separatorsmanifest.json: Export metadata and checksums
Use cases:
- Training language models
- Fine-tuning on your writing style
- Analysis and statistics
python3 export_studio.py export ssr ./my_ssrCreates:
ssr.jsonl: Full Structured Semantic Records with all metadata- Schema version tracked for reproducibility
SSR includes:
- Stable IDs
- Parent-child relationships
- Intent classification
- Topic extraction
- Content hashes
- Temporal information
python3 export_studio.py export pairs ./my_pairsCreates:
pairs.jsonl: Question-answer pairs mined from conversations
Format:
{
"id": "pair_xxx_yyy",
"a": "user question",
"b": "assistant answer",
"label": 1,
"type": "qa",
"meta": {"conversation_id": "...", "intent": "question"}
}Use cases:
- Supervised fine-tuning
- Question-answering models
- Instruction following
python3 export_studio.py export triples ./my_triplesCreates:
triples.jsonl: Anchor, positive, negative triplets
Format:
{
"anchor": "user message",
"positive": "correct assistant response",
"negative": "unrelated response from different conversation",
"meta": {"anchor_id": "...", "pos_id": "..."}
}Use cases:
- Contrastive learning
- Embedding model training
- Semantic similarity models
Specify custom database:
python3 export_studio.py --db /path/to/my_database.db listDefault: export_studio.db in current directory
Force re-import (if you've edited the export):
python3 export_studio.py import my-export.zip --forceWithout --force, duplicate exports (same hash) are skipped.
Metadata is extracted automatically using deterministic heuristics:
Intent Detection:
question: Contains "?" or starts with interrogatives (what, why, how)instruction: Starts with imperative verbs (build, create, make)explanation: Contains because/therefore/meansplan: Contains plan/roadmap/milestone keywordsother: Default fallback
Flags:
is_question: Question marks or interrogative startersis_code: Code fences (```) or high keyword densityis_list: Multiple lines starting with-,*, or numbershas_steps: Numbered steps or "Step N" patterns
Topics:
- Top 10 keywords after removing stopwords
- Deterministic, reproducible
The PIIRedactor class automatically detects and redacts:
- Email addresses →
[REDACTED_EMAIL_N] - Phone numbers →
[REDACTED_PHONE] - SSN patterns →
[REDACTED_SSN]
Future enhancement: Add --redact flag to export commands.
pip install pyinstallerpyinstaller export_studio.specOutput: dist/ExportStudio.exe (single-file executable)
ExportStudio.exe gui
ExportStudio.exe import my-export.zip
ExportStudio.exe listThe database schema supports projects for organizing conversations:
- Group related conversations
- Track exports per project
- Version control your datasets
Default: 800-1200 tokens, 15% overlap
Good for:
- Embedding generation (512-1024 token models)
- RAG retrieval
- Context windows
Recommended workflow:
- Import export ZIP
- Review conversations in GUI
- Chunk conversations
- Export corpus for analysis
- Export pairs/triples for training
- Export SSR for archival
FTS5 supports:
- Phrase search:
"exact phrase" - Boolean:
python AND machine learning - Prefix:
embed*matches embed, embeddings, embedded
Every export includes:
- Input hash (source data)
- Config hash (parameters)
- Output hash (generated data)
- Timestamps
- Record counts
This ensures:
- Reproducible pipelines
- Traceable artifacts
- Auditable datasets
Ensure your ZIP contains conversations.json at the root or in a subdirectory.
Check:
- Was the import successful?
- Are messages in the database?
python3 export_studio.py list - Try simpler queries first
Ensure Tkinter is installed:
- Ubuntu:
sudo apt-get install python3-tk - macOS: Included with Python
- Windows: Included with Python
Close other connections to the database. Only one write connection at a time.
- Import: ~1000 conversations/second
- Search: FTS5 is fast, even with millions of messages
- Chunking: ~500 conversations/second
- Export: Limited by disk I/O
For very large exports (>100k conversations):
- Consider batching exports
- Use SSD storage
- Increase system memory
- ✅ Import your export
- ✅ Explore with GUI or CLI
- ✅ Chunk for RAG
- ✅ Export datasets
- 🚧 Add local embeddings (future enhancement)
- 🚧 Implement hybrid search (future enhancement)
- 🚧 Train custom models on your data
- Documentation: README.md
- Issues: GitHub Issues
- Source: GitHub Repository
Privacy First: All processing happens locally. No data leaves your machine.