Skip to content

Next Step Development Plan v0.2.0 #2

@fbkaragoz

Description

@fbkaragoz

Stage 1: Attention Visualization

  • Add NF_MSG_ATTENTION_PATTERN packet type (token→token edges per head)
  • Implement attention matrix sampling in C++ probe
  • Frontend: render token graphs with edge weights (not neuron scatter)
  • Target: visualize multi-head attention without full connectivity assumption

Stage 2: Performance Profiling

  • Measure actual training overhead (currently untested in real training loop)
  • Add configurable backpressure (drop-oldest vs drop-newest)
  • Benchmark ring buffer contention under high-frequency hooks
  • Target: validate <5% overhead claim with real models

Stage 3: MoE Support

  • Add expert routing packet (NF_MSG_EXPERT_ROUTING)
  • Track per-expert utilization and load balancing
  • Frontend: expert utilization heatmap per layer
  • Target: debug MoE expert collapse during training

Stage 4: Production Hardening

  • Implement packet versioning mismatch handling (currently no graceful fallback)
  • Add reconnection logic with state sync (NF_OP_STATE_SNAPSHOT exists but unused)
  • Clean shutdown on Python interpreter exit (currently commented out)
  • Add CI/CD for build matrix (Python 3.9-3.12, CUDA/CPU variants)

Stage 5: Multi-Client & Recording

  • Server-side packet recording to disk (binary log replay)
  • Multi-client broadcast validation (currently untested)
  • Playback mode: load recorded session without training loop
  • Target: offline analysis and debugging

Stage 6: Documentation & Examples

  • Real PyTorch hook example (not just simulator)
  • Transformer-specific integration guide (Hugging Face, NanoGPT)
  • Video walkthrough of debugging a training run
  • Protocol specification as standalone doc

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions