A self-improving harness framework built as a PyTorch extension. Composable tensor ops + trainable experience + autograd = portable Modules that improve through training.
The result: portable, self-improving agents that share the same infrastructure but learn different skills through training. Like Claude Code slash commands, but standardized — typed tensors, autograd, shape contracts, trainable experience.
An LLM-powered coder composed entirely from reusable ops — no custom logic except a termination validator. Two chained experts collaborate: Expert 1 decides WHAT to do, Expert 2 knows HOW to do it.
# [1, 30] live terminal text
capture_op = ft_tmux_capture_pane(workspace)
# [1, 30] → "text:hint"|"ctrl:hint"
decision_expert = ft_expert(capture_op, decision_exp)
# [1, 30] → "echo hello"|"Enter"
cmd_expert = ft_expert(decision_expert, cmd_exp)
# [1, 30] routes by prefix
switched_send = ft_switch(decision_expert, [("text",…),("ctrl",…)])
# [1, 30]
validator = ft_coder_validator(ft_sequential(switched_send, sleep, capture))
# [1] reduced
output = ft_recurrent(validator)
Every op except ft_coder_validator is shared infrastructure reusable across all harness models.
Instead of one monolithic expert + custom parsers:
# BAD: custom parsing, not reusable, not trainable
decision = parse_action(expert_output) # brittle regex
cmd = extract_command(expert_output) # another custom parserChain two experts — each with its own trainable experience:
# GOOD: composable, trainable, extensible
decision_expert = ft_expert(capture_op, decision_experience, topk=2) # WHAT
cmd_expert = ft_expert(decision_expert, cmd_experience, topk=2) # HOWExpert 1's output feeds directly as Expert 2's input. New action types don't require changing Expert 2 — just add experience entries to Expert 1.
# Decision expert: terminal observation → action_type:hint
decision_experience = make_tensor([
["提示符后面没有其他文字", "命令行为空", "text:输入shell命令"],
["提示符后面有echo命令", "命令行已有内容", "ctrl:按回车执行"],
], tmpdir)
# Cmd expert: hint → real command
cmd_experience = make_tensor([
["text:输入shell命令\n任务是列出目录", "需要ls", "ls"],
["text:输入shell命令\n任务是输出hello", "需要echo", "echo hello world"],
["ctrl:按回车执行", "按Enter", "Enter"],
], tmpdir)These are [N, 3] QKV tensors — trainable via backward + StSGD. More experience = better generalization.
A harness Module is defined by:
- Experience tensors — the learned skill (trainable via autograd)
- Validator — the termination condition (harness-specific)
- Pipeline structure — identical across all modules (shared ops)
Same structure + different experience = different behavior:
| Module | Experience 1 | Experience 2 | Validator |
|---|---|---|---|
| Coder | terminal → action_type | hint → shell cmd | prompt returns to idle |
| Debugger | error → strategy | strategy → debug cmd | tests pass |
| Deployer | status → action | action → deploy cmd | service healthy |
python -m experience.future_tensor.test.test_llm_coder_simulatorPrograms are sequential, branch, and loop. We compact all three into typed tensor ops:
| Control flow | Op | What it does |
|---|---|---|
| Sequential | ft_sequential |
Do A then B |
| Branch | ft_switch |
If X then A else B |
| Loop | ft_recurrent |
Repeat until validator passes |
This makes agent generation trivial for LLMs. Instead of writing correct imperative code (hard — shared state, error handling, async coordination), the LLM composes a graph from ~10 typed ops (easy — slot-filling with shape verification). Three properties make this uniquely LLM-friendly:
- Structure vs behavior separation. The LLM generates the graph (structure) once. Behavior is determined by experience tensors, which self-improve through training. The LLM doesn't need to get behavior right at generation time.
- Shape contracts = static verification. If shapes don't match, the composition is wrong — detectable before execution. LLMs make fewer structural errors than behavioral errors.
- Fault-tolerant generation. Even poor seed experience gets corrected by training. The graph just needs to be structurally correct.
Each tensor element is a text file on disk. Numeric coefficients flow through standard PyTorch autograd while symbolic content (code, translations, commands) lives in files.
{relative_to}/{tensor_uid}/
├── shape # JSON: [2, 3]
├── storage/
│ ├── 0/data # Element at flat index 0
│ ├── 1/data # Element at flat index 1
│ └── 1/1/data # Multi-digit index 11
An [N, 3] tensor where each row is (query, key, value):
- Query: semantic keywords for Jaccard similarity retrieval
- Key: source domain content
- Value: target domain content
Starts empty, populated during training. The backward pass computes diffs, the optimizer patches entries.
A FutureTensor is a scalar torch.Tensor monkey-patched with ft_* attributes — a reference to a pending computation. Elements materialize on-demand via LLM calls.
| Op | Purpose |
|---|---|
ft_expert |
Query + retrieval + LLM translation (chainable) |
ft_switch |
Lazy control-flow branch selection |
ft_sequential |
Sequential evaluation with early-return on error |
ft_recurrent |
Generate-validate retry loop (reduces last dim) |
ft_tmux_* |
Terminal session ops (create, send_text, send_ctrl, capture) |
ft_slice / ft_unsqueeze |
Shape manipulation with autograd |
| Channel | Carries | Computed by |
|---|---|---|
| Numeric | Float coefficients (bfloat16) | Standard autograd |
| Symbolic | Unified diffs in files | LLM compares actual vs expected |
Optimizer (StSGD) applies both: numeric SGD update + patch CLI applies diffs to experience files.
The LLM reflects on its own reflections:
backward_dispatch_start = torch.ones((), dtype=torch.bfloat16, requires_grad=True)
anchored = need_reflection(input_ft, backward_dispatch_start)
output = model(anchored)
loss = ft_mean(output)
loss.backward(create_graph=True)
records = []
with dispatch_policy(TracePolicy(records)):
backward_dispatch_start.grad.backward()
# records: list of ReflectionRecord(fn, inputs, output, timestamp)- Static DAG, not imperative orchestration — compose a graph of lazy tensors, then pull from output
- Chain experts, don't parse — two chained
ft_expertops replace custom parsers - No shared mutable state — coordination through tensor coordinates
- Observe the world directly — read live state, no shadow copies
- Self-improving — experience accumulates knowledge via backward + optimizer step
- Experience starts empty — learned entirely at runtime, cold-start supported
- LLM as compute kernel — replaces matrix multiplication with semantic reasoning
- Stage 0: Tmux-forced coding agent capture (immediate). Hook Claude Code (via MCP or shell wrapper) to route all bash commands through tmux sessions. Capture every terminal interaction as
(observation, decision, command)triples. Write them directly into experience tensors for harness model cold start. No harness model needed — pure data harvesting from real coding sessions. - Adapter mechanism for existing coding agents. Two directions:
- Agent → Harness Module: wrap Claude Code / OpenCode / OpenClaw / Hermes as a Harness Module — their tool interfaces become
ft_*ops in the compute graph, gaining autograd and self-improvement for free. - Harness Module → Agent skill: export a trained Harness Module as a coding agent tool/skill — any agent can invoke it as a composable capability without knowing the internals.
- Agent → Harness Module: wrap Claude Code / OpenCode / OpenClaw / Hermes as a Harness Module — their tool interfaces become
- Ground-truth terminal interactions sharing. Two mechanisms:
- Capture streams hub: record live terminal sessions into standardized experience tensors via tmux capture. A central hub aggregates streams from multiple developers/agents into a shared experience pool.
- Learn from existing interactions: train harness agents on recorded ground-truth sessions — enabling experience transfer across agents. One agent's successful terminal interaction becomes another agent's seed experience.
- Meta tasks learning. For each new code repo, bootstrap experience through self-supervised tasks that require no human labels:
- Masked code reconstruction: mask a code region, train the agent to reconstruct it from surrounding context. Teaches code structure and local patterns.
- Docstring ↔ code: given a docstring, generate the implementation (and vice versa). Teaches intent-to-code mapping.
- Code coverage by tests: given a function, generate tests that maximize coverage. Teaches behavioral understanding.
- Runtime stack prediction: given a call site, predict the runtime call stack. Teaches control flow and dependency reasoning.
Four stages of increasing autonomy — each stage removes one human-in-the-loop dependency:
| Stage | Forward | Experience Update | Graph Update |
|---|---|---|---|
| 0 | Claude Code in tmux | captured terminal sessions → experience | manual graph |
| 1 | 0th reflection | coding agent directly | coding agent directly |
| 2 | 0th reflection | 1st reflection (autograd) | coding agent guided by 2nd reflection trace |
| 3 | 0th reflection | 1st reflection (autograd) | bootstrapped harness model guided by 2nd reflection trace |
Stage 0: Ground-truth capture for cold start. Force Claude Code (or any coding agent) to run all bash commands through tmux sessions instead of direct shell execution. Capture the full terminal output stream — every command typed, every response observed, every decision made. These captured sessions become ground-truth experience entries for harness model cold start: terminal observation → decision → command triples, written directly into experience tensors. No harness model runs yet — this stage purely harvests real-world coding interactions as training data.
Stage 1: Manual iteration. The harness runs forward (0th reflection). A coding agent (Claude Code, etc.) inspects results and directly edits experience tensors and the compute graph. This is the current working state — human-assisted improvement.
Stage 2: Self-improving experience, assisted graph evolution. Forward runs as before. The 1st derivative (backward pass) automatically updates experience via StSGD — no human needed for experience improvement. For graph structure changes, a coding agent reads the 2nd derivative trace (ReflectionRecord list) and decides which ops to add/remove/rewire.
Stage 3: Fully autonomous. Same as Stage 2, but the coding agent for graph updates is itself a trained harness model — a "meta-harness" whose experience is "how to improve compute graphs given 2nd derivative traces." The system bootstraps its own architecture search.
Current status: Stage 0 ready (tmux capture infrastructure works), Stage 1 implementing — forward pass, experience tensors, compute graph composition, speculative multi-action dispatch, and 2nd derivative trace recording all work. Experience update and graph update still require coding agent. Stage 2, 3 are TODO.
Base LLM weights carry two things: general capabilities (reasoning, reflection) and memories (facts, patterns, code idioms). Most of the parameters are memories.
This project externalizes memory into trainable experience tensors — retrievable, patchable, shareable, and composable. The base LLM only needs to be good at reasoning and reflection. Everything else lives in experience.
If Stage 3's self-referencing succeeds — a harness model improving its own compute graph via 2nd derivative traces — the system becomes an open-ended self-improver. Reasoning reflects on reasoning, experience accumulates without bound, and architecture evolves autonomously. That's the path.
Two LLM backends: raw_llm_api (OpenAI-compatible, lightweight) and coding_agent (Claude Agent SDK with file tools), dispatched concurrently via asyncio.gather. Autograd functions in symbolic_tensor/function/ implement StMoe (query → retrieval → LLM translate), attention, slicing (symlink views vs copies), stack, fork, merge, and edit-distance loss. Tensor utilities (tensor_util/) provide make_tensor, get_diff_tensor, patch_tensor (unified diffs, cold-start support), and Pythonic methods registered on torch.Tensor (st_pack, st_patch, st_get_diff, st_view_slicer[...], etc.).
pip install torch openai claude-agent-sdk LevenshteinRequires Python 3.13+.
experience/
├── symbolic_tensor/
│ ├── function/ # Autograd Functions: st_moe, st_attention, st_stack, slice_*, merge, fork, copy, loss
│ ├── tensor_util/ # Symbolic tensor primitives: make, slice, assign, diff, patch, dense/sparse
│ ├── module/ # nn.Module wrappers: StMoeModule, WithDenseView
│ ├── optimizer/ # StSGD: dual-channel (numeric + symbolic patch) optimizer
│ ├── data_loader/ # Batch data loading from files
│ └── test/ # Integration tests
├── future_tensor/
│ ├── future_tensor.py # FutureTensor factory: lazy async scalar tensor with ft_* attributes
│ ├── status.py # Status tagged union (confidence, scbf, kContextOverflow, etc.)
│ ├── function/ # Autograd ops: slice, unsqueeze, recurrent, expert, switch, sequential, tmux
│ └── backward_dispatch/ # 2nd-derivative framework: policies, dispatcher, GradFn wrappers
├── llm_client/ # LLM backends: raw API (OpenAI-compatible) and coding agent (Claude SDK)
├── sparse_util/ # Sparse coordinate operations (transpose, convert)
├── fs_util/ # File system utilities (directory packing, path enumeration, text merger)
├── test/ # End-to-end tests and benchmarks
└── example/ # Training demos