Four AI models ran the pipeline task spec — here is what happened #2
codifide
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The experiment
We gave four external AI models (GPT-4o, Gemini 2.5 Pro, Claude, GPT-5.4) the same task: build a content-moderation pipeline in Codifide from scratch, using only the docs. No prior knowledge of the language. Five programs to write, each building on the last.
Results
All four models completed all five programs. The friction points they hit became the v2.0 requirements.
Program 1 — Keyword classifier
All models: first-attempt success. The
belief(value, confidence)pattern and multi-cand dispatch were intuitive.Program 2 — Confidence-gated refusal
All models: first-attempt success. The
believeblock withge(conf(result), 0.70) => result / else => bottompattern worked cleanly.Program 3 — Escalation router
Most models: first-attempt success. One model hit the bind-before-when footgun (using a bound name in a
whenguard). Fixed in v2.0 as a parse error.Program 4 — Pipeline with I/O
All models: first-attempt success. The double-print behavior of
io.saysurprised everyone — documented in the cookbook.Program 5 — Content-addressed composition
Universal friction point before v2.0. Required four CLI commands, an index ceremony, and a runtime flag. Fixed by the RPC API in v2.0. After the fix: all models completed it on first attempt.
What we learned
The language is legible to fresh agents. The content-addressing story is the hardest part to explain but the most valuable once understood. The cookbook now documents 12 failure modes from these sessions.
Try it yourself
The task spec is at
docs/AGENT_TASK_SPEC.md. Run it against any model and share your results here.Beta Was this translation helpful? Give feedback.
All reactions