Skip to content

olneyjR/pace-wise-RAG

Repository files navigation

PaceWise - Running & Marathon Training Consensus Engine

A Retrieval-Augmented Generation (RAG) system that aggregates peer experiences from running communities to help runners find training strategies, injury prevention tips, and race preparation advice.

Overview

PaceWise uses advanced AI to synthesize insights from 150+ running discussion threads, covering everything from beginner 5K training to ultra marathon preparation, with extensive coverage of the Lydiard training method.

Key Features

  • Comprehensive Coverage: 600+ runner experiences across all distance categories
  • Lydiard Method: Detailed content on all five Lydiard training phases (base building, hills, anaerobic, coordination, taper)
  • Safety Guardrails: Automatic detection of serious injury concerns with professional referral guidance
  • Semantic Search: ChromaDB vector database for intelligent retrieval
  • Modern UI: Clean, athletic-themed Streamlit interface

Tech Stack

  • LLM: Groq API (Llama 3.3 70B Versatile)
  • Vector Database: ChromaDB (local persistence)
  • Embeddings: HuggingFace sentence-transformers (all-MiniLM-L6-v2)
  • Orchestration: LangChain
  • UI: Streamlit with custom CSS
  • Deployment: Streamlit Cloud ready

Dataset Coverage

Distance Categories

  • 1-Mile to 5K (beginners)
  • 10K training (intermediate)
  • Half Marathon (13.1 miles)
  • Marathon (26.2 miles)
  • Ultra Marathon (50K, 50-mile, 100-mile)

Topic Areas

  • Training plans and weekly mileage progression
  • Lydiard method (all five phases)
  • Speed work and interval training
  • Long run strategies
  • Injury prevention and recovery
  • Nutrition and hydration
  • Race day preparation and pacing
  • Mental strategies for distance running
  • Cross-training and strength work

Simulated Communities

  • r/running (general running discussion)
  • r/AdvancedRunning (competitive runners)
  • r/firstmarathon (marathon beginners)
  • r/ultrarunning (ultra distance specialists)

Installation

Prerequisites

Setup Instructions

  1. Clone or download this repository

  2. Create virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure API key

Create .streamlit/secrets.toml:

GROQ_API_KEY = "your_groq_api_key_here"
  1. Generate dataset
python enhanced_dataset_generator.py

This creates running_data.json with 150+ threads and 600+ responses.

  1. Ingest data and create vector store
python ingest.py

This processes the JSON and creates the ChromaDB vector database in chroma_db/.

  1. Run the application
streamlit run app_modern.py

The app will open in your browser at http://localhost:8501.

Usage

Example Queries

Lydiard Training:

  • "What is Lydiard base building and how long should it last?"
  • "How do I execute Lydiard hill circuits?"
  • "Explain the Lydiard anaerobic phase for marathoners"

General Training:

  • "How should I build weekly mileage for my first marathon?"
  • "What's the best pacing strategy for marathon?"
  • "How many gels should I take during a marathon?"

Injury Prevention:

  • "How can I prevent IT band syndrome?"
  • "What strengthening exercises prevent runner's knee?"
  • "Dealing with plantar fasciitis during training"

Safety Features

PaceWise includes automatic safety guardrails that detect high-risk keywords indicating serious injury:

  • Stress fractures
  • Severe pain symptoms
  • Cardiac concerns
  • Overtraining red flags

When triggered, the system provides professional referral guidance instead of peer advice.

Project Structure

pacewise/
├── requirements.txt                  # Python dependencies
├── enhanced_dataset_generator.py     # Dataset generation script
├── running_data.json                 # Generated discussion data
├── ingest.py                         # Vector store creation
├── rag_engine.py                     # Core RAG logic
├── app_modern.py                     # Streamlit application
├── .streamlit/
│   └── secrets.toml                  # API keys (gitignored)
├── chroma_db/                        # Vector database (gitignored)
├── .gitignore                        # Git exclusions
├── README.md                         # This file
└── DEPLOYMENT.md                     # Cloud deployment guide

How It Works

1. Data Generation

The enhanced_dataset_generator.py script creates realistic running discussion threads with:

  • Authentic runner terminology (PR, BQ, tempo runs, etc.)
  • Specific paces, distances, and workout structures
  • Progressive training advice (not just static tips)
  • Lydiard method coverage across all training phases

2. Vector Store Creation

The ingest.py script:

  • Loads thread data from JSON
  • Chunks content for optimal retrieval
  • Generates embeddings using HuggingFace transformers
  • Persists to ChromaDB for fast semantic search

3. RAG Query Pipeline

When a user asks a question:

  1. Safety guardrails check for high-risk keywords
  2. Query is embedded using the same model
  3. ChromaDB retrieves top-k relevant chunks
  4. Groq LLM synthesizes answer from retrieved context
  5. Sources are cited with thread IDs

4. UI Layer

Streamlit provides:

  • Clean, athletic-themed interface
  • Example queries for easy exploration
  • Source citation display
  • Safety warnings when appropriate

Development

Testing RAG Engine

python rag_engine.py

Regenerating Dataset

python enhanced_dataset_generator.py
python ingest.py  # Re-create vector store

Modifying Prompt Template

Edit the _create_prompt_template() method in rag_engine.py.

Adding Safety Keywords

Update INJURY_RED_FLAGS list in RunningRAGEngine class.

Deployment

See DEPLOYMENT.md for detailed Streamlit Cloud deployment instructions.

Quick steps:

  1. Push code to GitHub (excluding secrets and chroma_db)
  2. Deploy on Streamlit Cloud
  3. Add GROQ_API_KEY to Streamlit secrets
  4. App will automatically run ingestion on first load

Limitations

  • Not Medical Advice: This system provides peer experiences only. Always consult qualified professionals for medical concerns.
  • Synthetic Data: Current dataset is generated, not scraped from real forums
  • Knowledge Cutoff: LLM has training cutoff; may not reflect latest running science
  • Context Window: Limited to retrieved chunks; may miss broader discussion context

Future Enhancements

  • Real forum data scraping (with proper permissions)
  • Integration with training log APIs (Strava, Garmin)
  • Personalized recommendations based on user profile
  • Workout plan generation
  • Progress tracking integration

License

This project is for educational and personal use. Respect copyright and terms of service when adapting for real forum data.

Acknowledgments

  • Groq for fast LLM inference
  • LangChain for RAG orchestration
  • ChromaDB for vector storage
  • HuggingFace for embedding models
  • Running community for inspiring the content

Support

For issues or questions:

  1. Check DEPLOYMENT.md for deployment issues
  2. Verify API key configuration
  3. Ensure vector store was created successfully
  4. Check Streamlit logs for errors

Built with dedication to the running community. Keep moving forward.

About

RAG system aggregating peer running experiences for marathon training guidance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages