Four AI models ran the pipeline task spec — here is what happened #2

codifide · 2026-05-15T05:53:13Z

codifide
May 15, 2026
Maintainer

The experiment

We gave four external AI models (GPT-4o, Gemini 2.5 Pro, Claude, GPT-5.4) the same task: build a content-moderation pipeline in Codifide from scratch, using only the docs. No prior knowledge of the language. Five programs to write, each building on the last.

Results

All four models completed all five programs. The friction points they hit became the v2.0 requirements.

Program 1 — Keyword classifier

All models: first-attempt success. The belief(value, confidence) pattern and multi-cand dispatch were intuitive.

Program 2 — Confidence-gated refusal

All models: first-attempt success. The believe block with ge(conf(result), 0.70) => result / else => bottom pattern worked cleanly.

Program 3 — Escalation router

Most models: first-attempt success. One model hit the bind-before-when footgun (using a bound name in a when guard). Fixed in v2.0 as a parse error.

Program 4 — Pipeline with I/O

All models: first-attempt success. The double-print behavior of io.say surprised everyone — documented in the cookbook.

Program 5 — Content-addressed composition

Universal friction point before v2.0. Required four CLI commands, an index ceremony, and a runtime flag. Fixed by the RPC API in v2.0. After the fix: all models completed it on first attempt.

What we learned

The language is legible to fresh agents. The content-addressing story is the hardest part to explain but the most valuable once understood. The cookbook now documents 12 failure modes from these sessions.

Try it yourself

The task spec is at docs/AGENT_TASK_SPEC.md. Run it against any model and share your results here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Four AI models ran the pipeline task spec — here is what happened #2

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Four AI models ran the pipeline task spec — here is what happened #2

Uh oh!

codifide May 15, 2026 Maintainer

The experiment

Results

Program 1 — Keyword classifier

Program 2 — Confidence-gated refusal

Program 3 — Escalation router

Program 4 — Pipeline with I/O

Program 5 — Content-addressed composition

What we learned

Try it yourself

Replies: 0 comments

codifide
May 15, 2026
Maintainer