Automated Prompt Optimization via Evolutionary Search (External Tool) #41
Replies: 3 comments 2 replies
-
|
This is some sophisticated prompt engineering and I think it's worth experimenting with. The "user" here will be us devs, the end user should not worry about such optimization, this should simplify the design by a lot. The prompts should obviously be model-specific. I'm thinking we can manage these prompts in yaml files, have a directory, each file represents one model. The tricky part is that there's no way to tell which model the end user is really using, a heuristic approach like string matching won't be enough. Manual selection is also okay, the solution for this must be seamless, the end user may change models on the fly so it would be bad practice to force them to edit a yaml file every time.
This is an open design question. The concerns are the same as above - the user experience has to be seamless so manual copy is probably a no-go
No need to have versions for prompts. The latest Orb version gets the latest prompts.
Let's save eval datasets and maybe even pipelines for reproduction. No need versioning for this either.
If it's gonna be reusable then we can integrate in a reusable way. I'm fine with whatever. Wdyt? |
Beta Was this translation helpful? Give feedback.
-
|
Built it — far from great yet, but it already kinda works. Repo: https://github.com/glitchbunny0/orb-evo Two modes: editor evolution (deterministic scoring via Orb's audit) and director evolution (LLM judge scores writer output against real conversation data). The director mode is the more interesting one. The core loop works: mine conversations from Orb's DB → evaluate baseline → mutate → evaluate → select best. Still rough around the edges but already producing meaningful improvements. Have fun playing with it. |
Beta Was this translation helpful? Give feedback.
-
|
On second thought, let's just optimize for the smallest possible model there is to keep it simple. The bigger ones should benefit from the optimized prompt anyway. Better to avoid bloat and technical debt down the line. Best-effort is good enough. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Had some ideas after checking out https://github.com/NousResearch/hermes-agent-self-evolution
The Problem
Orb has several hand-crafted prompt strings that control critical pipeline behavior — the Director preamble, the Editor instructions (patch, rewrite, structural, combined), the tool descriptions for
direct_scene,editor_apply_patch,editor_rewrite, andrewrite_user_prompt. These prompts were written by humans and tuned through trial and error (probably?). They probably work reasonably well on some models and less well on others.The thing is: different models respond very differently to the same prompt wording. An instruction that makes Claude follow patch rules precisely might confuse Gemma. A directive that keeps Qwen concise might make GLM over-edit. There is no single "best" prompt — there is a best prompt per model.
The Idea
An external, standalone tool that uses evolutionary optimization to automatically discover the best prompt text for each target model. The tool operates on Orb (reads its prompts, runs its pipeline, scores the output) but is not part of Orb. It lives in its own repo, produces evolved prompt files as output, and the user copies the winners into
tool_defs.pymanually.The engine is GEPA (Genetic-Pareto Prompt Evolution) — a reflective evolutionary optimizer from DSPy (ICLR 2026 Oral). It works by:
No weight training, no fine-tuning. All models do inference only — a local model runs the prompt writer role (using whatever GPU/CPU you have), and an API model handles mutation analysis and judging. The optimization operates on prompt text strings, not model weights. Runs for ~$2-10 in API costs per optimization pass (local inference is free).
How It Would Work
The Models
The process uses three distinct roles, potentially on different models:
The Mutator and Prompt Writer are separate roles. The Mutator says "the editor is over-patching — the instructions should emphasize minimal changes and preserving the author's voice more strongly." The Prompt Writer then produces the actual text. Both roles benefit from strong reasoning, so they share the same API model. The Target model (local) is the one being optimized for — it runs the Orb pipeline with candidate prompts, and we measure how well it performs.
The Process
What Gets Evolved
Starting with the Editor pass (highest value, most measurable):
EDITOR_PREAMBLE— sets the editor's identityEDITOR_PATCH_INSTRUCTIONS— how to fix audit issues via patchesEDITOR_REWRITE_INSTRUCTIONS— how to rewrite for lengthEDITOR_BOTH_INSTRUCTIONS— combined fixSTRUCTURAL_REWRITE_INSTRUCTIONS— fix structural repetitionLater phases would tackle Director prompts and tool descriptions.
The Evaluation Dataset
Two sources:
Mined from Orb sessions — the SQLite conversation database contains real examples of Writer output, Editor audit reports, and the resulting edits. These become (input → expected output) pairs with ground truth.
Synthetically generated — run the pipeline with deliberately flawed outputs (insert banned phrases, repetitive openers, template patterns, length violations) and use the known fixes as ground truth.
The Fitness Function
This is where Orb's existing audit system shines. The scoring is deterministic — no LLM judge needed for the core metric:
For Director prompts (Phase 2), where "good scene direction" is subjective, an LLM-as-judge would be needed.
Model-Specific Results
Each optimization run targets one specific model. You'd evolve separately for each:
Each produces different prompt text, each measurably better on its target model than any generalized prompt could be. If Orb later wants to ship model-specific prompts, it could use a model→prompt lookup table. But that's downstream — the optimization tool itself just produces text files.
Constraints / Guardrails
Why External?
tool_defs.pyWhat This Does NOT Do
Open Questions
Beta Was this translation helpful? Give feedback.
All reactions