Automated Prompt Optimization via Evolutionary Search (External Tool) #41

glitchbunny0 · 2026-05-11T21:54:14Z

glitchbunny0
May 11, 2026

Had some ideas after checking out https://github.com/NousResearch/hermes-agent-self-evolution

The Problem

Orb has several hand-crafted prompt strings that control critical pipeline behavior — the Director preamble, the Editor instructions (patch, rewrite, structural, combined), the tool descriptions for direct_scene, editor_apply_patch, editor_rewrite, and rewrite_user_prompt. These prompts were written by humans and tuned through trial and error (probably?). They probably work reasonably well on some models and less well on others.

The thing is: different models respond very differently to the same prompt wording. An instruction that makes Claude follow patch rules precisely might confuse Gemma. A directive that keeps Qwen concise might make GLM over-edit. There is no single "best" prompt — there is a best prompt per model.

The Idea

An external, standalone tool that uses evolutionary optimization to automatically discover the best prompt text for each target model. The tool operates on Orb (reads its prompts, runs its pipeline, scores the output) but is not part of Orb. It lives in its own repo, produces evolved prompt files as output, and the user copies the winners into tool_defs.py manually.

The engine is GEPA (Genetic-Pareto Prompt Evolution) — a reflective evolutionary optimizer from DSPy (ICLR 2026 Oral). It works by:

Taking a prompt text as the starting "organism"
Running the pipeline with that prompt against a set of test cases
Scoring the results
Reading the execution traces to understand why things failed
Proposing targeted mutations to the prompt text
Generating new candidate variants
Repeating until convergence

No weight training, no fine-tuning. All models do inference only — a local model runs the prompt writer role (using whatever GPU/CPU you have), and an API model handles mutation analysis and judging. The optimization operates on prompt text strings, not model weights. Runs for ~$2-10 in API costs per optimization pass (local inference is free).

How It Would Work

The Models

The process uses three distinct roles, potentially on different models:

Role	Purpose	Example models
Target	The model whose behavior we are optimizing the prompt for. It runs the Orb pipeline (Director/Writer/Editor) using the candidate prompt.	Gemma 4, GLM-5.1, Claude, Qwen, etc.
Mutator	Analyzes execution traces — reads what went wrong and proposes how to change the prompt (not the new text itself, just the mutation strategy).	API model with strong reasoning
Prompt Writer	Takes the mutation strategy and writes a concrete new candidate prompt. Benefits from strong instruction following.	API model with strong reasoning

The Mutator and Prompt Writer are separate roles. The Mutator says "the editor is over-patching — the instructions should emphasize minimal changes and preserving the author's voice more strongly." The Prompt Writer then produces the actual text. Both roles benefit from strong reasoning, so they share the same API model. The Target model (local) is the one being optimized for — it runs the Orb pipeline with candidate prompts, and we measure how well it performs.

The Process

For each target model:
  1. Select a prompt to evolve (e.g., EDITOR_PATCH_INSTRUCTIONS)
  2. Load or generate an evaluation dataset
  3. Run the initial prompt against all eval cases, collect scores + traces
  4. Loop (N iterations):
     a. Mutator analyzes traces: "why did candidates fail/succeed?"
     b. Mutator proposes mutations: "change X, emphasize Y, remove Z"
     c. Prompt Writer generates new candidate from mutation + parent prompt
     d. Run new candidate against eval cases
     e. Score and rank all candidates
     f. Keep top K, discard the rest
  5. Final winner is the best-scoring candidate on the holdout set

What Gets Evolved

Starting with the Editor pass (highest value, most measurable):

EDITOR_PREAMBLE — sets the editor's identity
EDITOR_PATCH_INSTRUCTIONS — how to fix audit issues via patches
EDITOR_REWRITE_INSTRUCTIONS — how to rewrite for length
EDITOR_BOTH_INSTRUCTIONS — combined fix
STRUCTURAL_REWRITE_INSTRUCTIONS — fix structural repetition

Later phases would tackle Director prompts and tool descriptions.

The Evaluation Dataset

Two sources:

Mined from Orb sessions — the SQLite conversation database contains real examples of Writer output, Editor audit reports, and the resulting edits. These become (input → expected output) pairs with ground truth.
Synthetically generated — run the pipeline with deliberately flawed outputs (insert banned phrases, repetitive openers, template patterns, length violations) and use the known fixes as ground truth.

The Fitness Function

This is where Orb's existing audit system shines. The scoring is deterministic — no LLM judge needed for the core metric:

Issue fix rate: Run the audit pipeline on the edited output. How many flagged issues were actually resolved?
Preservation score: How much of the original text survived? Over-editing is punished.
Length compliance: If a length guard was triggered, did the edit actually hit the target?
No new issues: Did the edit introduce new slop/repetition that wasn't there before?

For Director prompts (Phase 2), where "good scene direction" is subjective, an LLM-as-judge would be needed.

Model-Specific Results

Each optimization run targets one specific model. You'd evolve separately for each:

output/
  editor_patch_instructions/
    gemma-4-31b/
    glm-5.1/
    claude-sonnet/

Each produces different prompt text, each measurably better on its target model than any generalized prompt could be. If Orb later wants to ship model-specific prompts, it could use a model→prompt lookup table. But that's downstream — the optimization tool itself just produces text files.

Constraints / Guardrails

Size cap: Evolved prompts stay within a max character limit
Semantic preservation: The prompt must still address the same use case (patch vs rewrite vs structural) — a prompt that optimizes scores by ignoring patch mode and always rewriting is rejected
Safety: Evolved prompts must not introduce new attack vectors
Deterministic: The fitness function uses Orb's own audit pipeline, not subjective LLM scoring (for Phase 1 at least)

Why External?

No new dependencies in Orb
No risk of breaking Orb during optimization runs
Can evolve against different Orb versions independently
The tool imports Orb's audit functions for scoring but never mutates them
Evolved prompts are output as text files — the user reviews and copies the winners into tool_defs.py

What This Does NOT Do

Does not modify Orb's pipeline architecture
Does not add runtime overhead to Orb
Does not auto-deploy anything — all changes are human-reviewed
Does not train or fine-tune any model — pure prompt text optimization

Open Questions

Would a model-specific prompt lookup in Orb itself be useful, or is manual copy sufficient?
Should the evolved prompts be versioned alongside Orb releases?
Is there interest in a shared eval dataset (real conversation examples) that contributors could use to test prompts?
Should this become a reusable tool for any project with prompt-driven pipelines, or stay Orb-specific?

OrbFrontend · 2026-05-12T04:47:13Z

OrbFrontend
May 12, 2026
Maintainer

This is some sophisticated prompt engineering and I think it's worth experimenting with. The "user" here will be us devs, the end user should not worry about such optimization, this should simplify the design by a lot.

The prompts should obviously be model-specific. I'm thinking we can manage these prompts in yaml files, have a directory, each file represents one model. The tricky part is that there's no way to tell which model the end user is really using, a heuristic approach like string matching won't be enough. Manual selection is also okay, the solution for this must be seamless, the end user may change models on the fly so it would be bad practice to force them to edit a yaml file every time.

Would a model-specific prompt lookup in Orb itself be useful, or is manual copy sufficient?

This is an open design question. The concerns are the same as above - the user experience has to be seamless so manual copy is probably a no-go

Should the evolved prompts be versioned alongside Orb releases?

No need to have versions for prompts. The latest Orb version gets the latest prompts.

Is there interest in a shared eval dataset (real conversation examples) that contributors could use to test prompts?

Let's save eval datasets and maybe even pipelines for reproduction. No need versioning for this either.

Should this become a reusable tool for any project with prompt-driven pipelines, or stay Orb-specific?

If it's gonna be reusable then we can integrate in a reusable way. I'm fine with whatever.

Wdyt?

0 replies

glitchbunny0 · 2026-05-12T23:30:45Z

glitchbunny0
May 12, 2026
Author

Built it — far from great yet, but it already kinda works.

Repo: https://github.com/glitchbunny0/orb-evo

Two modes: editor evolution (deterministic scoring via Orb's audit) and director evolution (LLM judge scores writer output against real conversation data). The director mode is the more interesting one.

The core loop works: mine conversations from Orb's DB → evaluate baseline → mutate → evaluate → select best. Still rough around the edges but already producing meaningful improvements. Have fun playing with it.

2 replies

OrbFrontend May 13, 2026
Maintainer

Nice I'll have a look.

OrbFrontend May 13, 2026
Maintainer

I tried to run it with this and got some issues, not sure if I'm doing it right. I had to prepend openai/ before mutator-model or remote deepseek would throw an error.

export ORB_REPO_PATH=~/Anonymous/Orb

python -u -m orb_evo.cli evolve-editor \
  --prompt editor_patch \
  --orb-repo ~/Anonymous/Orb \
  --target-base http://localhost:5000/v1 \
  --mutator-model openai/deepseek-v4-pro \
  --mutator-base https://api.deepseek.com \
  --iterations 10 --population 5 --dataset-size 20

Target model never received any prompts from generation 2 onward.

[orb-evo] === Generation 10/10 ===
  [1] Score: 0.5582 (baseline: 0.5582)
  [2] Score: 0.5582 (baseline: 0.5582)
  [3] Constraint violations: Prompt size 2217 exceeds max 2000
  [4] Constraint violations: Prompt size 2217 exceeds max 2000
  [5] Constraint violations: Prompt size 2217 exceeds max 2000
[orb-evo] Gen 10: best=0.5582, mean=0.5582, pop=7

[orb-evo] === Complete ===
  Baseline: 0.5582
  Best:     0.5582
  Improvement: +0.0000

OrbFrontend · 2026-06-02T15:42:53Z

OrbFrontend
Jun 2, 2026
Maintainer

On second thought, let's just optimize for the smallest possible model there is to keep it simple. The bigger ones should benefit from the optimized prompt anyway. Better to avoid bloat and technical debt down the line. Best-effort is good enough.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated Prompt Optimization via Evolutionary Search (External Tool) #41

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Automated Prompt Optimization via Evolutionary Search (External Tool) #41

Uh oh!

Uh oh!

glitchbunny0 May 11, 2026

The Problem

The Idea

How It Would Work

The Models

The Process

What Gets Evolved

The Evaluation Dataset

The Fitness Function

Model-Specific Results

Constraints / Guardrails

Why External?

What This Does NOT Do

Open Questions

Replies: 3 comments · 2 replies

Uh oh!

OrbFrontend May 12, 2026 Maintainer

Uh oh!

glitchbunny0 May 12, 2026 Author

Uh oh!

OrbFrontend May 13, 2026 Maintainer

Uh oh!

OrbFrontend May 13, 2026 Maintainer

Uh oh!

OrbFrontend Jun 2, 2026 Maintainer

glitchbunny0
May 11, 2026

Replies: 3 comments 2 replies

OrbFrontend
May 12, 2026
Maintainer

glitchbunny0
May 12, 2026
Author

OrbFrontend May 13, 2026
Maintainer

OrbFrontend May 13, 2026
Maintainer

OrbFrontend
Jun 2, 2026
Maintainer