A Retrieval-Augmented Generation (RAG) system that aggregates peer experiences from running communities to help runners find training strategies, injury prevention tips, and race preparation advice.
PaceWise uses advanced AI to synthesize insights from 150+ running discussion threads, covering everything from beginner 5K training to ultra marathon preparation, with extensive coverage of the Lydiard training method.
- Comprehensive Coverage: 600+ runner experiences across all distance categories
- Lydiard Method: Detailed content on all five Lydiard training phases (base building, hills, anaerobic, coordination, taper)
- Safety Guardrails: Automatic detection of serious injury concerns with professional referral guidance
- Semantic Search: ChromaDB vector database for intelligent retrieval
- Modern UI: Clean, athletic-themed Streamlit interface
- LLM: Groq API (Llama 3.3 70B Versatile)
- Vector Database: ChromaDB (local persistence)
- Embeddings: HuggingFace sentence-transformers (all-MiniLM-L6-v2)
- Orchestration: LangChain
- UI: Streamlit with custom CSS
- Deployment: Streamlit Cloud ready
- 1-Mile to 5K (beginners)
- 10K training (intermediate)
- Half Marathon (13.1 miles)
- Marathon (26.2 miles)
- Ultra Marathon (50K, 50-mile, 100-mile)
- Training plans and weekly mileage progression
- Lydiard method (all five phases)
- Speed work and interval training
- Long run strategies
- Injury prevention and recovery
- Nutrition and hydration
- Race day preparation and pacing
- Mental strategies for distance running
- Cross-training and strength work
- r/running (general running discussion)
- r/AdvancedRunning (competitive runners)
- r/firstmarathon (marathon beginners)
- r/ultrarunning (ultra distance specialists)
- Python 3.8+
- Groq API key (get from https://console.groq.com)
-
Clone or download this repository
-
Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Configure API key
Create .streamlit/secrets.toml:
GROQ_API_KEY = "your_groq_api_key_here"- Generate dataset
python enhanced_dataset_generator.pyThis creates running_data.json with 150+ threads and 600+ responses.
- Ingest data and create vector store
python ingest.pyThis processes the JSON and creates the ChromaDB vector database in chroma_db/.
- Run the application
streamlit run app_modern.pyThe app will open in your browser at http://localhost:8501.
Lydiard Training:
- "What is Lydiard base building and how long should it last?"
- "How do I execute Lydiard hill circuits?"
- "Explain the Lydiard anaerobic phase for marathoners"
General Training:
- "How should I build weekly mileage for my first marathon?"
- "What's the best pacing strategy for marathon?"
- "How many gels should I take during a marathon?"
Injury Prevention:
- "How can I prevent IT band syndrome?"
- "What strengthening exercises prevent runner's knee?"
- "Dealing with plantar fasciitis during training"
PaceWise includes automatic safety guardrails that detect high-risk keywords indicating serious injury:
- Stress fractures
- Severe pain symptoms
- Cardiac concerns
- Overtraining red flags
When triggered, the system provides professional referral guidance instead of peer advice.
pacewise/
├── requirements.txt # Python dependencies
├── enhanced_dataset_generator.py # Dataset generation script
├── running_data.json # Generated discussion data
├── ingest.py # Vector store creation
├── rag_engine.py # Core RAG logic
├── app_modern.py # Streamlit application
├── .streamlit/
│ └── secrets.toml # API keys (gitignored)
├── chroma_db/ # Vector database (gitignored)
├── .gitignore # Git exclusions
├── README.md # This file
└── DEPLOYMENT.md # Cloud deployment guide
The enhanced_dataset_generator.py script creates realistic running discussion threads with:
- Authentic runner terminology (PR, BQ, tempo runs, etc.)
- Specific paces, distances, and workout structures
- Progressive training advice (not just static tips)
- Lydiard method coverage across all training phases
The ingest.py script:
- Loads thread data from JSON
- Chunks content for optimal retrieval
- Generates embeddings using HuggingFace transformers
- Persists to ChromaDB for fast semantic search
When a user asks a question:
- Safety guardrails check for high-risk keywords
- Query is embedded using the same model
- ChromaDB retrieves top-k relevant chunks
- Groq LLM synthesizes answer from retrieved context
- Sources are cited with thread IDs
Streamlit provides:
- Clean, athletic-themed interface
- Example queries for easy exploration
- Source citation display
- Safety warnings when appropriate
python rag_engine.pypython enhanced_dataset_generator.py
python ingest.py # Re-create vector storeEdit the _create_prompt_template() method in rag_engine.py.
Update INJURY_RED_FLAGS list in RunningRAGEngine class.
See DEPLOYMENT.md for detailed Streamlit Cloud deployment instructions.
Quick steps:
- Push code to GitHub (excluding secrets and chroma_db)
- Deploy on Streamlit Cloud
- Add GROQ_API_KEY to Streamlit secrets
- App will automatically run ingestion on first load
- Not Medical Advice: This system provides peer experiences only. Always consult qualified professionals for medical concerns.
- Synthetic Data: Current dataset is generated, not scraped from real forums
- Knowledge Cutoff: LLM has training cutoff; may not reflect latest running science
- Context Window: Limited to retrieved chunks; may miss broader discussion context
- Real forum data scraping (with proper permissions)
- Integration with training log APIs (Strava, Garmin)
- Personalized recommendations based on user profile
- Workout plan generation
- Progress tracking integration
This project is for educational and personal use. Respect copyright and terms of service when adapting for real forum data.
- Groq for fast LLM inference
- LangChain for RAG orchestration
- ChromaDB for vector storage
- HuggingFace for embedding models
- Running community for inspiring the content
For issues or questions:
- Check
DEPLOYMENT.mdfor deployment issues - Verify API key configuration
- Ensure vector store was created successfully
- Check Streamlit logs for errors
Built with dedication to the running community. Keep moving forward.