Actions don't merely select among outcomes — they generate information about how the environment responds to interventions. This fundamental capacity creates an irreducible tension: the agent must balance exploiting what it already knows against exploring to improve its model. The optimal policy integrates both through a unified objective that jointly maximizes value and causal information yield.
Epistemic status: The structural claim — that the optimal policy jointly maximizes value and causal information — is discussion-grade, well-supported by convergent results in Bayesian RL, active inference, and information-directed sampling. The specific form of
TF-02 establishes that action-contingent observations carry interventional (Level 2) information that passive observation cannot provide. To quantify this, define the canonical causal information yield of an action
[Definition (CIY-canonical)] $$\text{CIY}(a;, M) = \mathbb{E}{a' \sim q(\cdot \mid M)}!\left[D{\mathrm{KL}}!\left(P(o \mid do(a), M) ,|, P(o \mid do(a'), M)\right)\right]$$
where
Because it is an expectation of KL divergences,
Dependence on the reference distribution
Default convention for empirical work. TFT adopts the policy-induced
For diagnostic use with observational statistics, define a proxy:
[Definition — Auxiliary Proxy] $$\text{CIY}{\text{proxy}}(a{t-1}) = I(o_t; a_{t-1} \mid M_{t-1}) - I(o_t; a_{t-1} \mid \Omega_t, M_{t-1})$$
This proxy can be useful but is sign-indefinite in general and requires causal assumptions for interpretation. TFT treats
Sign-indefiniteness and the observation model. The second MI term
Why this is non-trivial:
Maximizing causal information yield is exactly what good exploration does — the "investigating" function of Feldbaum's dual control1: choosing actions whose consequences are maximally informative about the causal structure of the environment. Note that the "environment" may include other agents whose models can be queried. When a query action elicits a response from a knowledgeable source, the CIY can be extremely high — the response is causally downstream of the query and carries pre-compressed information about the environment that would require many direct probes to reconstruct. See the Query Actions section below.
To use CIY rigorously in the unified policy objective (below) and in exploration arguments throughout, we distinguish three regimes:
Regime A — Randomized interventions. When actions are randomized (or otherwise exogenous conditional on
Regime B — Observational with causal assumptions. When the agent cannot freely vary its actions (policy-constrained, nearly passive, or observational), CIY estimation requires additional causal assumptions: a known DAG structure, instrumental variables, or functional form assumptions about confounding. In this regime, CIY estimates carry an identifiability qualifier — they are valid only to the extent that the causal assumptions hold. CIY is conditionally estimable; results that depend on it inherit the causal assumptions.
Regime C — Adversarial communication. When the observation channel includes responses from potentially adversarial sources, the agent's own CIY (from its action of querying) remains non-negative — the query causally generates a response. However, the content of that response may be designed to increase the agent's model-reality mismatch rather than decrease it. This is not "negative CIY" — the mutual information between query and response is still positive — but rather adversarial exploitation of the agent's update channel. The adversary injects disturbance through
For TFT, downstream equations treat
Before incorporating CIY into policy objectives or empirical analyses, verify:
-
Action variation exists. The agent must actually vary its actions across comparable model states. An agent that always takes the same action in a given state has CIY
$\approx 0$ for decision purposes — there is no interventional contrast. - Regime is identified. Regime A (randomized/varied actions) permits direct CIY estimation. Regime B (observational) requires explicit causal assumptions — state these. Regime C (adversarial sources) requires separating the information channel from the disturbance channel.
-
Reference distribution
$q$ is specified. CIY values are not comparable across different$q$ choices. Use the policy-induced default unless otherwise justified (see above). -
Stationarity holds locally. CIY estimation requires that the model state
$M_{t-1}$ and the environment dynamics are approximately stationary over the estimation window. Under rapid drift, CIY estimates are stale.
If any of these conditions fail, CIY-based exploration terms in the policy objective should be dropped or replaced with simpler uncertainty-based heuristics (e.g., UCB-style bonuses based on visit counts or ensemble disagreement).
The canonical CIY definition uses interventional outcome distributions. This raises the practical question: when can those distributions be estimated from observable quantities? (For the proxy, additional dependence on latent
With interventional data (the typical case for active agents). An agent that varies actions across comparable model states and observes resulting outcomes can estimate
With observational data only (passive or constrained agents). When the agent cannot freely vary actions, interventional distributions are not directly observed. CIY estimation then requires additional assumptions. Specifically, it requires either:
- A causal graph (DAG) with known structure, enabling do-calculus adjustment
- Instrumental variables that affect actions but not outcomes except through the causal pathway
- Assumptions about the functional form of confounding
Without such assumptions, CIY is not identifiable from observational data alone. This is Pearl's fundamental insight restated in our formalism: you cannot learn causal structure from correlations without either experiments or causal assumptions. In this setting,
Practical implication. Active agents have access to interventional data as a consequence of acting. The quality of CIY estimation depends on action diversity (exploration) and local stationarity of model state during estimation. An agent that always takes the same action in a given state cannot identify action-conditioned effects and effectively has CIY near zero for decision purposes. This provides an information-theoretic argument for exploration that complements TF-05's mismatch argument.
The exploration-exploitation tension suggests a single objective that the optimal policy
[Discussion — Normative Objective]
The first term is the exploitation objective — expected value given the current model. The second term is the exploration objective — the causal information yield of the action (defined above as an expectation over the comparator distribution
- When
$U_M$ is high (model uncertain):$\lambda$ is large — exploration is valuable because the model has much to learn. - When
$U_M$ is low (model confident):$\lambda$ is small — exploitation dominates because the model is (probably) already good. - When the time horizon is long:
$\lambda$ should be larger — the information gained now compounds over many future decisions. - When
$\rho$ is high (fast-changing environment):$\lambda$ should be larger — the model is perpetually uncertain because the environment keeps changing (connecting to TF-10).
Note on dimensional consistency and status. The two terms in this objective have different natural units: the first is in value units (reward, utility, cost), the second in information units (bits, nats). The coefficient
| Domain | What |
Status |
|---|---|---|
| Bayesian bandits | Gittins index (implicit information price derived from dynamic programming) | Exactly derived |
| Kalman dual control | Probing cost in expected quadratic objective (both terms in cost units) | Exactly derived |
| Active inference | Precision on epistemic affordance (both terms in free-energy units) | Framework-derived |
| Information-directed sampling | Ratio of squared value-of-information to information gain | Exactly derived (Russo & Van Roy) |
| RL with UCB | Confidence-bound scaling |
Heuristic (tuned) |
| Human decision-making | Not explicit; manifests as curiosity drive vs. reward seeking | Empirical |
The pattern:
Note on CIY estimability. This objective requires the agent to evaluate
Connection to active inference. The Free Energy Principle's "expected free energy" objective decomposes into an extrinsic value term (pragmatic, goal-directed) and an epistemic value term (information-seeking). The TFT formulation above is structurally isomorphic: expected value ≈ extrinsic value, expected CIY ≈ epistemic value. Whether this convergence is deep (both are instances of the same mathematical principle) or superficial (similar-looking objectives from different foundations) is an open question. The TFT formulation has the advantage of grounding exploration in explicitly causal information rather than in entropy reduction, which may be more precise — not all uncertainty reduction is equally valuable; causal information is specifically what enables better intervention, not merely better prediction.
Action selection faces a fundamental tension:
Exploit: Choose
Explore: Choose
Exploitation uses the model as-is. Exploration tests the model.
The optimal balance depends on:
-
Model uncertainty (
$U_M$ high → explore more) - Time horizon (long → invest in exploration early)
- Cost of exploration (high → explore cautiously)
-
Mismatch history (persistent
$\delta \neq 0$ → investigate the source)
This connects directly to the zero-mismatch ambiguity (TF-05): an agent that only exploits will tend toward confirmation bias — observing only what its model already explains. Exploration is the mechanism by which the agent actively tests its model, converting the ambiguous case (b) in TF-05 into genuine signal.
Actions don't merely affect the environment — they generate information that passive observation cannot provide. TF-02 establishes three levels of epistemic access grounded in the causal structure of the feedback loop: associational (Level 1), interventional (Level 2), and counterfactual (Level 3).
Action selection is what makes Levels 2 and 3 available. By choosing to act and then observing consequences, the agent generates causal information yield (defined above) — information about how the environment responds to interventions, not merely about correlations. This is why the feedback loop (action → observation → update) is more powerful than passive observation (observation → update) alone.
The exploration-exploitation trade-off (above) is fundamentally about how much causal information yield the agent seeks: exploitation maximizes predicted value at Level 1; exploration maximizes expected causal information yield at Level 2.
The discussion above — and the theory's examples throughout — has implicitly framed information-generating actions as direct environment probes: do something to the world, observe the result, update the model. But there is a qualitatively different class of actions with distinctive properties: querying another agent's model.
When a reliable external model exists — an expert, a database, a reference text, a trustworthy advisor, a well-trained LLM — the action "ask a well-formed question" can yield information equivalent to thousands of probe-observe cycles. The classic illustration: asked to measure a building's height with a barometer, one can drop it from the roof and time the fall, measure shadow ratios, swing it as a pendulum at the top and bottom — all Level 2 probes of the physical environment. Or one can offer the barometer to the building's janitor in exchange for the answer — accessing information that already exists compressed in another agent's model.
Information density. A single well-targeted query to a knowledgeable source can carry CIY orders of magnitude higher than any individual environment probe. The source's model has already performed the compression work (TF-03) — extracting the relevant sufficient statistic from a vast interaction history the querying agent has never had. The response transfers the output of that compression rather than requiring the agent to reconstruct it from scratch.
Trust-dependent gain. The update gain
Pre-compressed information. Environmental probes return raw observations that must be compressed through the agent's own model. Query responses arrive already compressed in the source's representational framework. This is why they're so information-dense — but it also introduces a translation cost when the source's representation doesn't align with the agent's. An expert's answer may be incomprehensible to a novice not because the information is absent but because the agent lacks the representational capacity (TF-07's model class
Structural adaptation via external models. Query actions can trigger not just parametric updates but structural change. Encountering another agent's model — through conversation, reading, apprenticeship, or consultation — is one of the primary mechanisms by which agents acquire new representational frameworks. This connects to TF-10's "grafting" mechanism: incorporating external structure rather than building it de novo. Boyd's thought experiment specifically illustrates cross-domain recombination — freeing components from their native domains and reassembling them into novel model structures.
Implications for optimal action selection. When high-CIY query channels are available, the unified policy objective (above) will tend to favor query actions over direct probes, particularly when:
- The agent's own
$U_M$ is high (it has much to learn) - A trusted source with high
$S$ for the relevant domain is accessible - The cost of querying (social, monetary, time) is low relative to the cost of direct probing
- The information needed is about structure (requiring many probes to reconstruct) rather than about the agent's specific situation (which only the agent can observe)
Direct probing remains essential when the information needed is situational (no external model has access to the agent's specific environment state), when no trustworthy source exists, or when the agent needs to verify claims rather than accept them. The optimal agent uses both channels, allocating actions to whichever has higher expected CIY per unit cost.
Query actions have a dark mirror in adversarial settings. The same communicative channel that enables cooperative information transfer — where one agent's response improves another's model — can be exploited to degrade the opponent's model. This is the domain of game theory, information warfare, and strategic communication.
Deception as adversarial disturbance injection. A cooperative query yields positive CIY — the response genuinely improves the agent's model. A deceptive response also yields positive CIY in the strict information-theoretic sense (the mutual information between query and response is still non-negative). But the content of the response is designed to increase rather than decrease model-reality mismatch: the agent updates its model in a direction that moves it away from the true environment state. The update gain
Active OODA loop interference. A central theme in Boyd's work — and in strategic thought at least since Sun Tzu — is that an adversary can do more than merely adapt faster (TF-10); it can actively interfere with the opponent's feedback loop: generating ambiguous or contradictory signals to degrade the opponent's Orient phase, creating false patterns to induce model errors, manipulating the information environment to increase the opponent's
Trust as a meta-model under adversarial pressure. In adversarial settings, the agent's model of source reliability becomes a critical adaptive target in its own right. An adversary who can corrupt this meta-model — making the victim trust unreliable sources or distrust reliable ones — achieves a second-order attack: not just injecting bad information, but compromising the victim's capacity to evaluate information. This connects to the effects spiral (Corollary A.3.1): once an agent's trust calibration degrades, it becomes increasingly vulnerable to further deception, creating a positive-feedback loop in model corruption.
Symmetry with cooperative query actions. The same formal structure — communicative actions with trust-dependent gain operating on another agent's model — encompasses both the cooperative case (teaching, consulting, honest signaling) and the adversarial case (deception, disinformation, strategic ambiguity). What differs is alignment: whether the source's response is optimized to improve or to degrade the receiver's model. The game-theoretic literature on cheap talk, signaling games, and mechanism design addresses when honest communication is incentive-compatible — a question TFT does not attempt to answer but whose relevance it makes formally precise.
Multi-agent and game-theoretic extensions. Appendix F extends the query-action framework to
| Domain | Exploration: direct probing | Exploration: query actions |
|---|---|---|
| Kalman + LQR | Dual control (rare) | — (no external models) |
| RL | ε-greedy, UCB, Thompson | Imitation learning, reward shaping from demonstrations |
| PID | Perturbation testing | Consulting plant specifications |
| Boyd's OODA | Probing, feints, recon | Intelligence gathering, interrogation, liaison with allies |
| Organism | Play, curiosity, foraging | Social learning, asking, observing experts |
| Organization | R&D, experiments, pilots | Hiring consultants, benchmarking, acquiring companies |
| Science | Experimentation | Literature review, peer consultation, conference attendance |
| Immune | Random antibody generation | Maternal antibodies, microbiome signaling |
Footnotes
-
Feldbaum, A. A. (1960). "Dual control theory I–IV." Avtomatika i Telemekhanika, 21(9). ↩