A self-improving harness framework built as a PyTorch extension. Composable tensor ops + trainable experience + autograd = portable Modules that improve through training.
The result: portable, self-improving agents that share the same infrastructure but learn different skills through training. Like Claude Code slash commands, but standardized — typed tensors, autograd, shape contracts, trainable experience.
An LLM-powered coder composed entirely from reusable ops — no custom logic except a termination validator. Two chained experts collaborate: Expert 1 decides WHAT to do, Expert 2 knows HOW to do it.
# [1, 30] live terminal text
capture_op = ft_tmux_capture_pane(workspace)
# [1, 30] → "text:hint"|"ctrl:hint"
decision_expert = ft_expert(capture_op, decision_exp)
# [1, 30] → "echo hello"|"Enter"
cmd_expert = ft_expert(decision_expert, cmd_exp)
# [1, 30] routes by prefix
switched_send = ft_switch(decision_expert, [("text",…),("ctrl",…)])
# [1, 30]
validator = ft_coder_validator(ft_sequential(switched_send, sleep, capture))
# [1] reduced
output = ft_recurrent(validator)
Every op except ft_coder_validator is shared infrastructure reusable across all harness models.
Instead of one monolithic expert + custom parsers:
# BAD: custom parsing, not reusable, not trainable
decision = parse_action(expert_output) # brittle regex
cmd = extract_command(expert_output) # another custom parserChain two experts — each with its own trainable experience:
# GOOD: composable, trainable, extensible
decision_expert = ft_expert(capture_op, decision_experience, topk=2) # WHAT
cmd_expert = ft_expert(decision_expert, cmd_experience, topk=2) # HOWExpert 1's output feeds directly as Expert 2's input. New action types don't require changing Expert 2 — just add experience entries to Expert 1.
# Decision expert: terminal observation → action_type:hint
decision_experience = make_tensor([
["提示符后面没有其他文字", "命令行为空", "text:输入shell命令"],
["提示符后面有echo命令", "命令行已有内容", "ctrl:按回车执行"],
], tmpdir)
# Cmd expert: hint → real command
cmd_experience = make_tensor([
["text:输入shell命令\n任务是列出目录", "需要ls", "ls"],
["text:输入shell命令\n任务是输出hello", "需要echo", "echo hello world"],
["ctrl:按回车执行", "按Enter", "Enter"],
], tmpdir)These are [N, 3] QKV tensors — trainable via backward + StSGD. More experience = better generalization.
A harness Module is defined by:
- Experience tensors — the learned skill (trainable via autograd)
- Validator — the termination condition (harness-specific)
- Pipeline structure — identical across all modules (shared ops)
Same structure + different experience = different behavior:
| Module | Experience 1 | Experience 2 | Validator |
|---|---|---|---|
| Coder | terminal → action_type | hint → shell cmd | prompt returns to idle |
| Debugger | error → strategy | strategy → debug cmd | tests pass |
| Deployer | status → action | action → deploy cmd | service healthy |
python -m experience.future_tensor.test.test_llm_coder_simulatorPrograms are sequential, branch, and loop. We compact all three into typed tensor ops:
| Control flow | Op | What it does |
|---|---|---|
| Sequential | ft_sequential |
Do A then B |
| Branch | ft_switch |
If X then A else B |
| Loop | ft_recurrent |
Repeat until validator passes |
This makes agent generation trivial for LLMs. Instead of writing correct imperative code (hard — shared state, error handling, async coordination), the LLM composes a graph from ~10 typed ops (easy — slot-filling with shape verification). Three properties make this uniquely LLM-friendly:
- Structure vs behavior separation. The LLM generates the graph (structure) once. Behavior is determined by experience tensors, which self-improve through training. The LLM doesn't need to get behavior right at generation time.
- Shape contracts = static verification. If shapes don't match, the composition is wrong — detectable before execution. LLMs make fewer structural errors than behavioral errors.
- Fault-tolerant generation. Even poor seed experience gets corrected by training. The graph just needs to be structurally correct.
Each tensor element is a text file on disk. Numeric coefficients flow through standard PyTorch autograd while symbolic content (code, translations, commands) lives in files.
{relative_to}/{tensor_uid}/
├── shape # JSON: [2, 3]
├── storage/
│ ├── 0/data # Element at flat index 0
│ ├── 1/data # Element at flat index 1
│ └── 1/1/data # Multi-digit index 11
An [N, 3] tensor where each row is (query, key, value):
- Query: semantic keywords for Jaccard similarity retrieval
- Key: source domain content
- Value: target domain content
Starts empty, populated during training. The backward pass computes diffs, the optimizer patches entries.
A FutureTensor is a scalar torch.Tensor monkey-patched with ft_* attributes — a reference to a pending computation. Elements materialize on-demand via LLM calls.
| Op | Purpose |
|---|---|
ft_expert |
Query + retrieval + LLM translation (chainable) |
ft_switch |
Lazy control-flow branch selection |
ft_sequential |
Sequential evaluation with early-return on error |
ft_recurrent |
Generate-validate retry loop (reduces last dim) |
ft_tmux_* |
Terminal session ops (create, send_text, send_ctrl, capture) |
ft_slice / ft_unsqueeze |
Shape manipulation with autograd |
| Channel | Carries | Computed by |
|---|---|---|
| Numeric | Float coefficients (bfloat16) | Standard autograd |
| Symbolic | Unified diffs in files | LLM compares actual vs expected |
Optimizer (StSGD) applies both: numeric SGD update + patch CLI applies diffs to experience files.
The LLM reflects on its own reflections:
second_derivative_start = torch.ones((), dtype=torch.bfloat16, requires_grad=True)
anchored = need_2nd_derivative(input_ft, second_derivative_start)
output = model(anchored)
loss = ft_mean(output)
loss.backward(create_graph=True)
records = []
with dispatch_policy(TracePolicy(records)):
second_derivative_start.grad.backward()
# records: list of ReflectionRecord(fn, inputs, output, timestamp)- Static DAG, not imperative orchestration — compose a graph of lazy tensors, then pull from output
- Chain experts, don't parse — two chained
ft_expertops replace custom parsers - No shared mutable state — coordination through tensor coordinates
- Observe the world directly — read live state, no shadow copies
- Self-improving — experience accumulates knowledge via backward + optimizer step
- Experience starts empty — learned entirely at runtime, cold-start supported
- LLM as compute kernel — replaces matrix multiplication with semantic reasoning
- Adapter mechanism for existing coding agents. Two directions:
- Agent → Harness Module: wrap Claude Code / OpenCode / OpenClaw / Hermes as a Harness Module — their tool interfaces become
ft_*ops in the compute graph, gaining autograd and self-improvement for free. - Harness Module → Agent skill: export a trained Harness Module as a coding agent tool/skill — any agent can invoke it as a composable capability without knowing the internals.
- Agent → Harness Module: wrap Claude Code / OpenCode / OpenClaw / Hermes as a Harness Module — their tool interfaces become
- Ground-truth terminal interactions sharing. Two mechanisms:
- Capture streams hub: record live terminal sessions into standardized experience tensors via tmux capture. A central hub aggregates streams from multiple developers/agents into a shared experience pool.
- Learn from existing interactions: train harness agents on recorded ground-truth sessions — enabling experience transfer across agents. One agent's successful terminal interaction becomes another agent's seed experience.
- Meta tasks learning. For each new code repo, bootstrap experience through self-supervised tasks that require no human labels:
- Masked code reconstruction: mask a code region, train the agent to reconstruct it from surrounding context. Teaches code structure and local patterns.
- Docstring ↔ code: given a docstring, generate the implementation (and vice versa). Teaches intent-to-code mapping.
- Code coverage by tests: given a function, generate tests that maximize coverage. Teaches behavioral understanding.
- Runtime stack prediction: given a call site, predict the runtime call stack. Teaches control flow and dependency reasoning.
Three stages of increasing autonomy — each stage removes one human-in-the-loop dependency:
| Stage | Forward | Experience Update | Graph Update |
|---|---|---|---|
| 1 | 0th reflection | coding agent directly | coding agent directly |
| 2 | 0th reflection | 1st reflection (autograd) | coding agent guided by 2nd reflection trace |
| 3 | 0th reflection | 1st reflection (autograd) | bootstrapped harness model guided by 2nd reflection trace |
Stage 1: Manual iteration. The harness runs forward (0th reflection). A coding agent (Claude Code, etc.) inspects results and directly edits experience tensors and the compute graph. This is the current working state — human-assisted improvement.
Stage 2: Self-improving experience, assisted graph evolution. Forward runs as before. The 1st derivative (backward pass) automatically updates experience via StSGD — no human needed for experience improvement. For graph structure changes, a coding agent reads the 2nd derivative trace (ReflectionRecord list) and decides which ops to add/remove/rewire.
Stage 3: Fully autonomous. Same as Stage 2, but the coding agent for graph updates is itself a trained harness model — a "meta-harness" whose experience is "how to improve compute graphs given 2nd derivative traces." The system bootstraps its own architecture search.
Current status: Stage 1 implementing — forward pass, experience tensors, compute graph composition, and 2nd derivative trace recording all work. Experience update and graph update still require coding agent. Stage 2, 3 are TODO.
Base LLM weights carry two things: general capabilities (reasoning, reflection) and memories (facts, patterns, code idioms). Most of the parameters are memories.
This project externalizes memory into trainable experience tensors — retrievable, patchable, shareable, and composable. The base LLM only needs to be good at reasoning and reflection. Everything else lives in experience.
If Stage 3's self-referencing succeeds — a harness model improving its own compute graph via 2nd derivative traces — the system becomes an open-ended self-improver. Reasoning reflects on reasoning, experience accumulates without bound, and architecture evolves autonomously. That's the path.
Two LLM backends: raw_llm_api (OpenAI-compatible, lightweight) and coding_agent (Claude Agent SDK with file tools), dispatched concurrently via asyncio.gather. Autograd functions in symbolic_tensor/function/ implement StMoe (query → retrieval → LLM translate), attention, slicing (symlink views vs copies), stack, fork, merge, and edit-distance loss. Tensor utilities (tensor_util/) provide make_tensor, get_diff_tensor, patch_tensor (unified diffs, cold-start support), and Pythonic methods registered on torch.Tensor (st_pack, st_patch, st_get_diff, st_view_slicer[...], etc.).
pip install torch openai claude-agent-sdk LevenshteinRequires Python 3.13+.
experience/
├── symbolic_tensor/
│ ├── function/ # Autograd Functions: st_moe, st_attention, st_stack, slice_*, merge, fork, copy, loss
│ ├── tensor_util/ # Symbolic tensor primitives: make, slice, assign, diff, patch, dense/sparse
│ ├── module/ # nn.Module wrappers: StMoeModule, WithDenseView
│ ├── optimizer/ # StSGD: dual-channel (numeric + symbolic patch) optimizer
│ ├── data_loader/ # Batch data loading from files
│ └── test/ # Integration tests
├── future_tensor/
│ ├── future_tensor.py # FutureTensor factory: lazy async scalar tensor with ft_* attributes
│ ├── status.py # Status tagged union (confidence, scbf, kContextOverflow, etc.)
│ ├── function/ # Autograd ops: slice, unsqueeze, recurrent, expert, switch, sequential, tmux
│ └── second_derivative/ # 2nd-derivative framework: policies, dispatcher, GradFn wrappers
├── llm_client/ # LLM backends: raw API (OpenAI-compatible) and coding agent (Claude SDK)
├── sparse_util/ # Sparse coordinate operations (transpose, convert)
├── fs_util/ # File system utilities (directory packing, path enumeration, text merger)
├── test/ # End-to-end tests and benchmarks
└── example/ # Training demos