This project implements an instrumented causal language model wrapper and server that produces token-by-token generation and captures per-token traces.
- Instrumented Model: A wrapper around a Hugging Face causal language model that captures logits, hidden states, and attention matrices.
- Streaming API: A WebSocket-based API for real-time streaming of generated tokens and traces.
- Persistence: Session traces are saved to compressed artifacts for later analysis.
- Intervention API: An API for modifying previous generations and re-running them.
Clone the repository and install the required dependencies:
git clone https://github.com/your-username/instrumented-llm.git
cd instrumented-llm
pip install -r requirements.txtStart the FastAPI server using Uvicorn:
uvicorn app.main:app --host 0.0.0.0 --port 8000To run the test suite, use the following command:
python -m unittest discover testscurl -X 'POST' \
'http://localhost:8000/api/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Hello, world!",
"max_new_tokens": 10
}'- Make a request to the
/api/generateendpoint with"stream": trueto get a WebSocket URL. - Connect to the WebSocket URL to receive the streaming results.
GET /api/session/{session_id}/metadata: Get the metadata for a session.GET /api/session/{session_id}/artifact: Get the path to the session artifact.