Skip to content

ercasta/harneskills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

705 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HarneSkills

A generic orchestration engine for narrow domains, driven by a probabilistic knowledge base that doubles as the grammar for both reasoning and language.


The Central Idea

HarneSkills is built on one observation: reasoning and language are the same operation — parsing over a grammar.

A production rule is a symbol, a context, and a probability distribution over complete replacements. This single object simultaneously describes a linguistic rule (how a structure expands), a causal rule (how a condition produces effects), and a logical inference rule (how a premise licenses conclusions). These are not analogies — they are the same formal object viewed from different directions.

The Knowledge Base is a probabilistic context-free grammar over domain concepts. Every "skill" is a production rule. All operations — NLP parsing, hypothesis formation, code generation, code understanding — are parsing over this grammar in different directions: top-down for generation and planning, bottom-up for recognition and diagnosis.

The grammar drives computation. Given an observed effect, the planner inverts the causal branch distribution to rank candidate causes as hypotheses. Given a goal, it applies production rules top-down to derive what must be built or fixed. There is one mechanism; direction and starting point vary.


Architecture — The Fetch-Execute Loop

The system runs as a fetch-execute loop: a dumb Python driver, a stateless Reasoner, and a Plan object as the program counter.

plan = Plan()
while True:
    action, plan = reasoner.step(world_model, kb, plan, goal)
    if action is None:
        break    # plan.status is "goal_reached" or "impasse"
    result = dispatcher.call(action, world_model)
    world_model.apply(result)

WorldModel — the unified foundation. Both the Domain Model (current session state) and the Knowledge Base (accumulated prior knowledge) are instances of the same structure: a graph of scoped, confidence-weighted, embedding-carrying facts with typed edges. Every fact records its source, provenance, and derivation. Contradictions are first-class — two conflicting facts under different scopes both survive as evidence. Facts live at the coarsest granularity useful for planning; finer structure is opaque Python inside Fact.value, extracted by tools only when a decision requires it.

Knowledge Base — a probabilistic AND-OR hypergraph of production rules. OR-nodes are branch points (exactly one branch fires); AND-nodes are branch interiors (all targets co-occur). Context is encoded in the node name. Relation-type terminals (causes, requires, has_part, is_a, precedes, resolves, annotates) identify the reasoning lane. Tools are KB entries: each tool has causes-lane rules (postconditions) and requires-lane rules (preconditions), so the Reasoner discovers tools by navigating the KB.

Reasoner — a stateless parser over the KB grammar. Given the current WorldModel, KB, Plan, and goal, it returns the next action and an updated Plan. The Plan object carries all reasoning state between calls (hypotheses, scheduled actions, evidence trail). Hypothesis formation is bottom-up parsing: observing an effect, the Reasoner inverts the causes-lane distribution to rank candidate causes, then schedules diagnostic actions to confirm or disconfirm them.

Dispatcher — the only component that crosses the tool boundary. A function-call mediator: execute one action, run it through the two-layer inbound pipeline (mechanical adapter → heuristic mapper), and return the result. All tool outputs return through a mapper before anyone sees them.

Mappers — translate between tool-specific formats and the domain model. Two per tool: an outbound mapper (domain model + action → tool input) and an inbound mapper (tool output → domain model delta + typed residuals). When a mapper cannot resolve a slot, it emits a typed Residual rather than guessing.

Tools — analysis tools, transformation tools, reasoning tools. Internal tools (pure Python) and external tools (subprocess, API calls) are registered identically. The Reasoner discovers tools via their KB causes/requires rules.


Knowledge Representation — Production Rules

Every KB entry is a production rule: a named non-terminal (the KB key) and a probability distribution over complete alternatives (branches). Branch probabilities sum to 1.0 uniformly. Context is encoded in the node name — distinct names for distinct contexts; absence of a rule for impossible transitions.

# Cause with probabilistic effects
kb.put("cause.discount_exceeds_price", {
    "branches": [
        {"targets": ["negative_result"],  "probability": 0.85},
        {"targets": ["overflow_error"],   "probability": 0.15},
    ],
    "embedding": {"arithmetic": 0.9, "boundary_violation": 0.8},
})

# Concept with probabilistic decomposition
kb.put("component.homepage_with_news", {
    "branches": [
        {"targets": ["news_list", "hero_section", "auth"],
         "probability": 0.70, "embedding_delta": {"auth_required": 0.9}},
        {"targets": ["news_list", "hero_section"],          "probability": 0.20},
        {"targets": ["news_list"],                          "probability": 0.10},
    ],
    "embedding": {"public_facing": 0.9, "news_heavy": 0.8},
})

# Fix strategy with ranked templates
kb.put("fix.arithmetic_reduction", {
    "branches": [
        {"targets": ["add_result_guard"],
         "probability": 0.50, "embedding_delta": {"invasive": +0.1}},
        {"targets": ["clamp_result"],           "probability": 0.35},
        {"targets": ["add_precondition_check"],
         "probability": 0.15, "embedding_delta": {"invasive": +0.3}},
    ],
    "embedding": {"risky": -0.5, "reversible": 0.8, "localized": 0.7},
})

Embeddings are sparse named dimensions authored by domain experts — not dense neural vectors. Each dimension has explicit meaning ("risky", "reversible", "invasive", …). Similarity is dot product over shared dimensions.

Branch probabilities are priors; embeddings rescore them. At selection time, the parent node's embedding rescores its branch alternatives by alignment with the branch targets' embeddings — a Bayesian update inside each OR-node:

P(branch_i selected | parent) ∝  branch_i.probability          # corpus prior
                                × sim(parent.embedding,          # embedding likelihood
                                      branch_targets.embedding)

This is the mechanism for gradable qualities that resist symbolic encoding. "Which fix is more conservative?" has no symbolic answer, but the dot product over {"risky", "reversible", …} ranks the branches without a hard threshold. The objective function's soft dimension weights modulate the rescoring — raising the "risk" weight steers selection toward conservative branches without converting the preference into a hard constraint.

Embeddings propagate through derivations as attribute grammar equations: authored on each node, modified per branch via embedding_delta, synthesized bottom-up for anonymous/derived nodes, and used both for OR-node rescoring and for cross-entry retrieval ranking by the grounder.

The boundary between causal and structural entries blurs by design — structural decomposition is a form of causation; a causal violation entry is a structural decomposition of an error scenario. The unified shape prevents silos. Context is encoded in the node name: "homepage_with_news" and "homepage" are distinct nodes, independently updatable.

Branch probabilities are statistical estimates, not design constants. They are updated by EM (inside-outside over the AND-OR hypergraph) as corpora are read — the same algorithm used to train probabilistic context-free grammars.


How Reasoning Works

Reasoning is parsing. The planner runs the KB grammar in two directions:

Recognition (bottom-up) — diagnosis and hypothesis formation. When an effect is observed in the domain model, the planner looks up KB entries whose branches produce that effect, inverts the branch distribution, and ranks candidate causes as hypotheses. It then schedules the cheapest diagnostic actions to confirm or disconfirm each hypothesis in probability order, updating posteriors as evidence accumulates. This is exactly chart parsing run bottom-up: observed facts are the yield, KB rules are the grammar, hypotheses are parse trees.

Derivation (top-down) — planning and generation. When a goal is set, the planner applies production rules in descending probability order from the goal non-terminal, repeatedly expanding the frontier until all leaves are achievable actions or observable facts. Unresolved slots surface as typed residuals, which the planner plans around — scheduling ask_user for human-fillable gaps and LLM tools as a constrained last resort.

The objective function (hard constraints, soft scoring dimensions, plausibility thresholds) steers derivation at every branch point. Hard constraints eliminate candidates before scoring; soft dimensions rank the admissible set; plausibility thresholds suspend implausible branches rather than eliminating them, allowing recovery if the leading hypothesis fails.


KB Bootstrapping — Corpus Reading as Rule Extraction

The KB is populated primarily by reading corpora, not by manual authoring. Every corpus — source code, documentation, type stubs, API specs, execution traces — is evidence from which production rules can be extracted. The process has two explicit steps:

  1. Rule extraction — parse the corpus against the current KB grammar (recognition direction); matched subgraphs confirm and refine existing entries; unmatched subgraphs are candidates for new rules.

  2. Probability optimization — run EM (inside-outside over the AND-OR hypergraph) to fit branch probabilities to the observed corpus distribution. This is standard PCFG training; the KB grammar is the model, the corpus parse trees are the training data.

The loop is self-improving: a richer KB parses more accurately, yielding higher-quality extraction, which further enriches the KB.

The NLP grammar is the same grammar. Because the parser's vocabulary is compiled from KB content at startup (entity names from KB entry keys, synonyms from optional KB entry fields), every production rule extracted from any corpus — code or documentation — also improves NLP parsing. There is no separate NLP training step. Reading documentation improves code understanding; reading code improves language parsing; both improve each other through the shared KB grammar.

Bootstrapping requires only a minimal seed: language primitives (arithmetic operators, built-in functions) derived once from type stubs and stdlib documentation. Everything above primitives is extracted, not hand-authored.


Partial Computation — The Discipline

Every tool in the system follows one discipline: produce output proportional to available input. If a slot is missing, emit a typed Residual for it. If a value is ambiguous, emit the ambiguity explicitly. Do not guess. Do not use silent defaults.

A confident wrong value propagates through the domain model undetected. An explicit residual is always preferable. Residuals are not errors — they are the primary mechanism by which the system communicates what it needs next, routes to resolution actions, and eventually decides to ask the user or declare an impasse.

Impasse is first-class output: a structured object carrying the full evidence trail — every decision, every heuristic, every KB entry consulted. It is not a crash; it is the debugging artifact that identifies exactly what to extend in the KB or the mapper heuristics.


Reference Domain — Python Bugfixing

The Python bugfixing domain is the first concrete domain built on HarneSkills. It ships alongside the harness as the reference implementation and development testbed.

What it does: given a failing test suite and a natural language symptom description, the system locates the bug, classifies the pattern, selects and applies a fix template, and validates the result — without LLM involvement in the planning or decision loop (LLMs are registered as constrained last-resort tools).

Concrete trace:

Input: "calculate_discount returns wrong values when discount > price"

1. NLP parser       → SyntacticFrame(predicate="symptom", subject="calculate_discount", ...)
2. Grounder         → symptom.type=constraint_violation, witness={discount > price}
3. Planner          → locate_function("calculate_discount")
4. Dispatcher       → location.file="pricing.py", location.line=14
5. Planner          → pattern_classify → pattern=arithmetic_reduction
6. Planner          → apply_fix(add_result_guard)  [KB prior 0.85 > clamp_result 0.35]
7. Dispatcher       → patched source written
8. Planner          → run_mypy → clean; pytest_verify → pass
9. Goal reached. KB updated with confirmed fix.

Every decision is auditable. Every step traces to the KB entry that drove it.

Patterns implemented: arithmetic_reduction, division_by_zero, sequence_index_error — each requiring only a catalog entry, a KB entry, and a rewriter; zero changes to the engine.


Status

Milestone Status
WorldModel (DM + KB unified, scoped, typed edges, serialization) Complete
Dispatcher + mapper framework Complete
Reasoner (beam search, KB-driven scoring, hard constraints) Complete
Six bugfixing probes end-to-end (S0–S6) Complete
Grounder (embedding-based, no string taxonomy) Complete
Failed-fix recovery (candidate invalidation + replanning) Complete
Standalone unification matcher + hit-rate on real repos Complete — 62% favorable repo
Bottom-up composition (S5-comp) Complete
In-loop recognizer replacing label classifier (S1) Complete
KB schema refactor (ProductionRule, EmbeddingEquations, is_a) Complete — spec v2.9
Architecture refactor: fetch-execute loop + Reasoner + tools-as-KB-entries Spec v3.0 — tracks A/B/C defined
Generation spike — second domain, zero engine changes (S8) Next
EM probability refinement from corpora Post-S8

532 tests, all green (436 fast / 96 slow).


Documentation

Document Contents
docs/harness_arch_spec.md Full architectural specification (v3.0) — the single design reference
docs/implementation_handoff.md Implementation record: findings, probe results, refactoring tracks A/B/C, next actions

About

An "intelligent" harness for agentic systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages