fix(engine): wire campaign framing into tick + GPU llama3.1:8b#7
Merged
Conversation
…is severed
Ran 6 pilots (UC1/UC2/UC3 baseline+reframed) against the post-R8-3 engine
via a new reusable harness at backend/scripts/run_use_case_pilot.py. All
3 README use cases failed to reproduce their quantitative claims:
| Case | README claim | Actual |
|-------------------------|--------------------|----------|
| uc1_baseline | stall at 12% | 97.3% |
| uc1_reframed | 31% | 97.4% |
| uc2_strategy_b | echo chamber | cascade |
| uc2_strategy_c | viral cascade | cascade |
| uc3_rto_raw | -38% eng sentiment | +0.70 |
| uc3_rto_restructured | -60% opposition | -0.3 pts |
Every pilot produced an identical step-by-step trajectory within a given
population size — controversy swung 0.80 to 0.15, utility 0.20 to 0.85,
and the final adoption rate moved by 0.002. That's the smoking gun: the
campaign framing inputs have zero effect on the simulation.
Root cause: CampaignConfig.{novelty,utility,controversy} are read into
CampaignEvent in step_runner.py and then dropped at the
_build_environment_events() boundary. The agent tick loop builds
MessageStrength from agent-derived values (media_signal,
cognition.evaluation_score) and a campaign_controversy method
parameter that defaults to 0.0 and is never set by any caller. The
entire R8-3 formula reformulation was mathematically correct but
operating on values that never come from the actual user inputs.
What this commit adds:
* backend/scripts/run_use_case_pilot.py — reusable pilot runner with
6 named cases, deterministic seeds, httpx-based API driver, and
JSON-output to docs/pilot_results/{case}.json
* docs/USE_CASE_PILOTS.md — full side-by-side of README claims vs
actual engine output, root cause writeup pointing at the exact
lines in step_runner.py + tick.py, and 5 proposed follow-up items
(wire fix, regression tests, re-calibration, LLM hardening, README
disclaimer)
* docs/pilot_results/*.json — raw per-case artifacts so the analysis
can be re-verified from the source data
The opinion synthesis plumbing from PR #2 held up perfectly — all 6
pilots got non-stub llama3.2:1b responses through the unique-constraint
+ shape-guarded persistence path. The small LLM hallucinated narratives
that matched the README (e.g. "rapid cascade in early_adopters stalls
against skeptic resistance") while the actual metrics showed every
community at 86-100% adoption. That's a separate hardening follow-up.
Next P1 task is the wire fix. Estimated: ~30 min CC, then a fresh pilot
round to verify. Regression tests in test_04_simulation_acceptance.py
will pin the outcome so this can't silently regress again.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pilot round in docs/USE_CASE_PILOTS.md found that every
Prophet simulation produced identical step-by-step trajectories
regardless of campaign framing — controversy=0.8 and controversy=0.2
both landed at final_adoption=0.973±0.001. Root cause (traced to
exact lines in the previous session): CampaignConfig.novelty and
.utility were read into CampaignEvent in step_runner.py and then
silently dropped before reaching the tick loop. Only .controversy was
forwarded, and it was forwarded as a method parameter that defaulted
to 0.0 and was never set by any caller. The entire campaign-framing
UI was effectively decoration.
This commit fixes the wire end-to-end across three layers, then
re-runs all six pilots on GPU to verify the fix.
## Wire fix (Round 8-6)
**1. community_orchestrator.py** — extract all three framing values
from the CampaignEvent and pass them into both AgentTick.tick() and
AgentTick.async_tick() alongside the existing campaign_controversy
forwarding.
**2. tick.py** — MessageStrength construction now blends:
novelty = 0.6 * campaign_novelty + 0.4 * media_signal
utility = 0.6 * campaign_utility + 0.4 * (evaluation_score / 2)
controversy = campaign_controversy (pure campaign — it's the
objective polarising-ness of the message, not an
agent-perception quantity)
The 0.6/0.4 weights were tuned so a controversy=0.8 to controversy=0.2
swing produces a ~0.42 point delta in raw score (before clamp),
which is enough to move adoption 20+ points on the early steps.
**3. cognition.py** — Tier-1 rule engine gained a campaign_bonus term:
bonus = 0.3 * (utility - 0.5) + 0.2 * (novelty - 0.5)
evaluation += bonus * 2.0
This is centered at 0 for neutral campaigns so prior fixtures stay
green, but shifts evaluation_score by ±0.25 on extreme framings —
enough to move the ADOPT decision threshold meaningfully. evaluate()
and evaluate_async() both take new campaign_novelty + campaign_utility
parameters and the Tier-3 LLM fallback path also threads them through.
## Regression test
test_04_step_runner.py::TestCampaignFramingAffectsOutcome runs two
sims with identical seeds + populations but opposite framings
(friendly: novelty=0.85, utility=0.85, controversy=0.15 vs
hostile: novelty=0.15, utility=0.15, controversy=0.85) and asserts:
abs(friendly.adoption_rate - hostile.adoption_rate) >= 0.02
friendly.adoption_rate > hostile.adoption_rate
Without the wire fix the delta is 0.0000 (bit-identical). With the
fix it's +0.1817 at step 4, which would have caught the regression
immediately.
## Post-fix pilot deltas
| Pair | Pre-fix step-0 delta | Post-fix step-0 delta | Post-fix final delta |
|------|:---:|:---:|:---:|
| UC1 baseline -> reframed | +0.000 | **+0.236** | +0.017 |
| UC2 Strategy B -> Strategy C | +0.000 | **+0.264** | +0.017 |
| UC3 raw -> restructured | +0.000 | **+0.147** | **+0.185** |
UC3 raw is the clearest win — the hostile RTO mandate now produces
zero viral_cascade events and ends at 74.5% adoption vs 93.1% for
the restructured version. That's a real stall pattern, not just a
faster trajectory. UC1/UC2 still saturate at ~97% because the
1030-agent population crosses cascade critical mass even with
hostile framing; a 5K-10K run at the same weights would likely
produce sharper stalls.
## GPU + model upgrade (Round 8-6 stack changes)
* Ollama moved to GPU mode via `docker-compose.gpu.yml` — RTX 4070
SUPER 12 GiB runs llama3.1:8b at ~75 tok/s (CPU mode was ~4-8
tok/s). Every agent tick + opinion synthesis now completes in
sub-second wall time.
* Default model upgraded from llama3.2:1b to llama3.1:8b across
config.py, .env.example, docker-compose.yml, frontend/config/
constants.ts and four test files. llama3.1:8b is large enough
to stay anchored to the provided numeric evidence in the
opinion-synthesis prompt; the 1B model hallucinated narratives
matching the README claims instead of the actual metrics.
* Opinion synthesis timeout reverted from 120s (CPU fallback) back
to 30s now that GPU inference finishes in ~1-2s.
* README + CLAUDE.md Quick Start section rewritten with GPU as the
recommended path and CPU-only as a documented fallback with the
env-var overrides to flip back to llama3.2:1b.
## Runner + artifacts
`backend/scripts/run_use_case_pilot.py` was retuned to use the
llama3.1:8b default. All six result blobs under
`docs/pilot_results/*.json` regenerated with post-fix trajectories.
`docs/USE_CASE_PILOTS.md` gained a "Post-fix results (Round 8-6)"
section with before/after tables and an updated follow-up list
(population scaling + campaign_bonus weight tuning for sharper
stalls + echo-chamber detector gap).
## Test + CI
* Backend: `uv run pytest tests/ -q` → **1029 passed, 2 skipped**
(+1 new regression test, no regressions across the suite)
* The new test_04_step_runner.py::TestCampaignFramingAffectsOutcome
is the guardrail for this fix — it would have caught the original
wire gap immediately.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
showjihyun
added a commit
that referenced
this pull request
Apr 13, 2026
…n, validation) Two-pass code review found 11 issues across 6 backend files: Critical: - #1 registry._call_adapter: wrap raw str→LLMPrompt before adapter.complete() - #2 persist_step retry: re-insert EmergentEvent rows on rollback retry - #8 deps.py singletons: add threading.Lock + double-checked locking - #9 load_steps: bound EmergentEvent query with step≤max + limit Important: - #3 MC endpoint: asyncio.wait_for(300s) + 504 on timeout - #4 settings PUT: str() coercion on Chinese LLM provider fields - #5 monte_carlo.py: remove fragile iscoroutine guard, plain await - #6 _config_to_dict: dataclasses.asdict for community serialization - #7 UUID parse: _safe_uuid try/except replaces len>8 heuristic - #10 persist_step retry: also re-insert agent_states + propagation_events - #11 settings PUT: str() coercion on Anthropic/OpenAI/Gemini fields too All 57 targeted tests pass (test_29 + test_06 + test_05). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Found and fixed a Prophet engine bug where
CampaignConfig.{novelty, utility, controversy}were being read from the API payload but silently dropped before reaching the agent tick loop. Pilot-tested all three marketing use cases fromREADME.mdbefore and after the fix; committed the evidence.This branch has two commits:
629e0b0docs(pilots): verify README use cases end-to-end, find campaign wire is severed — built a reusable pilot harness (backend/scripts/run_use_case_pilot.py), ran 6 pilots, discovered all 6 produced bit-identical step-by-step trajectories regardless of campaign framing. Traced the root cause to exact file:line locations and wrotedocs/USE_CASE_PILOTS.mdwith the verdict and a 5-item follow-up list.790fe60feat(pilots): fix campaign framing wire + switch to GPU llama3.1:8b — applied the three-layer fix, added a regression test that would have caught the bug immediately, re-ran all 6 pilots on GPU, and updateddocs/USE_CASE_PILOTS.mdwith before/after deltas.The wire fix (three layers)
community_orchestrator.pyCampaignEvent, forward intotick()+async_tick()alongside the existingcampaign_controversywiretick.pyMessageStrengthconstruction now blends campaign inputs (60%) with agent-derived perception (40%):novelty = 0.6·campaign_novelty + 0.4·media_signal, same forutility.controversystays pure-campaigncognition.pycampaign_bonus = 0.3·(utility−0.5) + 0.2·(novelty−0.5), scaled ×2, folded intoevaluation_score. Centered at 0 for neutral campaigns so legacy fixtures stay greenRegression test
test_04_step_runner.py::TestCampaignFramingAffectsOutcomeasserts that two seed-identical sims with opposite campaign framings differ by ≥2 adoption points. Without the wire fix the delta is0.0000(bit-identical); with the fix it's+0.1817at step 4.Post-fix pilot deltas
UC3 is the flagship result: the raw RTO mandate now fires zero
viral_cascadeevents and stalls at 74.5% adoption, while the restructured version fires 3 cascades and reaches 93.1%. That's a real +18.5pt lift from restructuring — directionally reproduces README's "-60% opposition" claim.UC1 and UC2 still saturate at ~97% because the 1030-agent populations cross cascade critical mass even with hostile framing. The follow-ups to get exact "stall at 12%" reproduction (population scaling to 5K-10K + stronger
campaign_bonusweights) are documented indocs/USE_CASE_PILOTS.md#follow-up-items-post-round-8-6.Stack changes (Round 8-6)
docker-compose.gpu.ymloverride. RTX 4070-class GPU runs llama3.1:8b at ~75 tok/s (CPU mode was ~4-8 tok/s). Every agent tick + opinion synthesis now finishes in sub-second wall time.llama3.2:1b→llama3.1:8b. The 1B model was hallucinating opinion synthesis narratives that matched the README claims instead of the actual metrics. The 8B model stays anchored to the provided numeric evidence.120s → 30s. The 120s timeout was a CPU-mode workaround. GPU calls finish in ~1-2s so 30s is still 15× headroom.Test plan
uv run pytest tests/— 1029 passed, 2 skipped (+1 new regression test, no regressions)TestCampaignFramingAffectsOutcomepasses on the post-fix code, fails (delta=0.0000) on the pre-fix codellama3.1:8b, non-stub responses for every community-opinion + overall-opinion calldocs/pilot_results/*.jsonregenerated with post-fix trajectoriesraw → restructureddelta is +18.5 adoption points at final step (vs -0.3 pre-fix)Follow-up items (not in this PR)
campaign_bonusweights so UC1/UC2 produce sharper stalls (±0.5 delta onevaluation_scoreinstead of current ±0.25)🤖 Generated with Claude Code