feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy) by drewstone · Pull Request #405 · tangle-network/agent-runtime

drewstone · 2026-06-28T17:23:54Z

What

The proposer-profile optimization path for the self-improving supervisor — so the driver can actually beat a single agent given more compute, not just match it.

surface-worker.ts — thread the driver's brief into each worker (it was discarded: every spawn was an identical refine attempt). Steering now reaches the worker.
ablation.ts — the baseline driver prompt is now a real proposer: each worker a distinct, targeted hypothesis (direct fix → upstream cause → edge case → other module), not "try again".
gepa-driver-prompt.ts — GEPA optimizes the real driver/proposer prompt: each candidate runs a full selfImprovingSupervisor rollout, executable-graded by the supervised resolve (no LLM judge). Reports spend to the campaign cost meter so the backend-integrity guard sees a real backend.
hard-coding-env.ts — a mid-difficulty contamination-proof generated task (stack expression evaluator + edge cases) as the cheap optimization substrate (the original generated task is saturated; SWE-bench is too expensive to search prompt-space on). Calibrated reference→100% / stub→0% across 10 seeds.

The finding this PR exists to record (the bottleneck taxonomy)

Cost-aware ablation on real SWE-bench bugs: the driver-steered supervisor ties a single agent at ~10× cost, and so does more single-agent budget — because those bugs are capability-bottlenecked (8 independent shots also fail). Coordination adds search, not capability — it can only pay where the worker can solve the task but a single attempt misses the angle/edge-case (a search-bottlenecked regime). The hard eval is that regime; this PR ships the machinery + substrate to test it cheaply and certify winners on real tasks.

Verification

Typecheck (examples) clean; biome clean.
Adversarial review caught + fixed a blocker (cost not reported → integrity guard would abort the GEPA arm) + eval-integrity (dialect RNG aliasing) + train/serve mismatches (supervisor model + worker budget). All addressed.
All example/demo code; no changes to src/ (the keystone analyze knob it builds on is already on main via feat(supervise): propagated analyze knob — analyst feeds the driver (additive) #404).

…ifying proposer prompt The seam discarded the driver's brief (v1: 'worker IGNORES the brief'), so driver-steer was just expensive best-of-N identical workers — no steering. Thread the brief into each attempt (appended to the surface task prompt) so a re-spawn can take a DIFFERENT, targeted angle. Rewrite the baseline driver prompt as a real PROPOSER: each worker a distinct hypothesis (direct fix → upstream cause → edge case → other module). This is the prerequisite for proposer-profile optimization to have any traction.

… substrate - gepa-driver-prompt: GEPA now optimizes the REAL driver/proposer prompt — each candidate runs a full selfImprovingSupervisor rollout, executable-graded by the supervised resolve (no LLM judge). FIX (adversarial review P1, would crash the arm): report the supervised spend to ctx.cost.observe/ observeTokens so the backend-integrity guard sees a real backend, not a zero-cost stub. - hard-coding-env: a mid-difficulty contamination-proof generated task (stack expression evaluator) as the CHEAP optimization substrate — reference→100%/stub→0% across 10 seeds. FIX (P2): per-field salt on the dialect RNG (base⟺rounding were 100% aliased → spurious shortcut). - ablation: thread the supervisor router + matched worker budget into the optimize call (P3: train/serve regime must match). Known caveats noted: error-credit floor (~33%, constant across candidates so the relative GEPA signal is unaffected) + syntax-error denominator (0 either way). Adversarial review caught all of these pre-merge; verify phase: tsc+biome clean, eval calibrated.

tangletools · 2026-06-28T17:30:30Z

✅ No Blockers — `9c0fc48d`

Review health 100/100 · Reviewer score 89/100 · Confidence 65/100 · 2 findings (2 low)

deepseek: Correctness 89 · Security 89 · Testing 89 · Architecture 89

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟡 LOW GEPA supervisor loop budget formula doesn't scale with arm budget — examples/ablation-suite/gepa-driver-prompt.ts

Line 124-127: the supervised run inside the GEPA agent uses a fixed budget formula (worker.innerTurns ?? 6) * 3 + 16 maxIterations, independent of worker.budget. Meanwhile ablation.ts:217 scales deployment's maxIterations with arm.knobs.budget. The per-worker refine budget IS correctly threaded (worker.budget flows to runAgentic), but the driver's loop-length ceiling differs between GEPA candidate evaluation and deployment. With default budget=2 the difference is minor (34 vs 32); with budget=10 it's significant (34 vs 96). A prompt optimized for a 34-iteration regime may not be optimal in a 96-iteration regime. Intentional tradeof

🟡 LOW No tests for GEPA agent execution path or brief-threading — examples/ablation-suite/surface-worker.ts

The execute(brief: unknown) rewrite and the gepa-driver-prompt.ts agent function (lines 105-143 — real selfImprovingSupervisor called per candidate) have no unit or integration tests. The hard-coding-env.ts calibration self-check is thorough (10 seeds, 0 LLM cost), covering the eval surface. But the supervisor loop wiring (selfImprovingSupervisor → surfaceWorkerSeam → surfaceWorkerExecutor with brief threading) and the GEPA cost-reporting path (ctx.cost.observe/observeTokens) remain unvalidated against a live harness. Low severity because these are example files, not library code.

_{tangletools · 2026-06-28T17:30:27Z · trace}

tangletools

✅ Approved — 2 non-blocking findings — `9c0fc48d`

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-28T17:30:27Z · immutable trace}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	4 (2 low, 2 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	339.8s (2 bridge agents)
Total	339.8s

💰 Value — sound-with-nits

Threads the driver's brief into workers, rewires GEPA to optimize the real supervisor prompt (not a proxy), adds train/serve-match + cost reporting, and ships a mid-difficulty headroom task — each a real correctness fix; minor scaffold duplication vs the existing self-improving-coder example.

What it does: Four targeted changes in examples/ablation-suite/: (1) surface-worker.ts:73-80 now accepts the driver's brief and appends it to the task's systemPrompt, so each spawn can take a different angle (previously every spawn was an identical refine retry — the file's own prior comment marked this as a v1 simplification). (2) ablation.ts:35-51 replaces the terse baseline driver prompt with a proposer-st
Goals it achieves: Make the driver-steered supervisor actually capable of beating a single agent at higher compute. The PR body's bottleneck taxonomy frames it: on capability-bottlenecked bugs the supervisor can only tie; to beat a single agent it needs a SEARCH-bottlenecked regime where diversified attempts pay. The four changes deliver the machinery for that: brief-threading + a proposer baseline make attempts act
Assessment: Coherent and in-grain. Each change fixes a real correctness gap rather than adding surface area: the v1 simplification is retired with the exact signature the Executor interface already supported (surface-worker.ts:73); the train/serve match prevents a classic skew bug; the cost.observe call is necessary (without it the integrity guard aborts on a stub-looking cell); and the proxy→real rewiring in
Better / existing approach: Checked for an existing host-pytest AgenticSurface helper to extend rather than clone (rg 'pytestPassed|export const.*AgenticSurface|hostPytest|pytestSurface' across src/ and examples/). None exists in src/ — the only two implementations of this exact pattern are examples/self-improving-coder/self-improving-coder.ts:95 and the new examples/ablation-suite/hard-coding-env.ts:300. The new file is rig
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound-with-nits

Real proposer + GEPA-over-supervisor wiring that lands in the grain of the codebase; the new discriminating substrate ships calibrated but not yet swapped into the runnable ablation.

Integration: The three wiring changes all reach real consumers. (1) surface-worker.ts:73-80 — the driver's spawn brief reaches the worker: spawn_agent's task arg flows scope.spawn → runChild → executor.execute at src/runtime/supervise/scope.ts:585 and src/mcp/tools/coordination.ts:437, so reading it as brief and folding it into the surface task systemPrompt is correct (TS narrower-arity is fine). (2) ablat
Fit with existing patterns: Excellent. hard-coding-env.ts mirrors examples/self-improving-coder/self-improving-coder.ts in shape exactly (same AgenticSurface open/tools/call/score/close, same seed-derived tasks supplier, same reference/stub calibration self-check at lines 561-586 vs coder lines 251-257) — it extends the established pattern rather than inventing one. gepa-driver-prompt.ts now routes through selfImprovingSuper
Real-world viability: Holds up. Each task gets its own mkdtempSync tempdir (hard-coding-env.ts:327) cleaned in close (line 407-412) — the same crash-leak risk self-improving-coder already carries, not new. pytestPassed (line 300) has a 60s timeout and parses partial stdout on failure — realistic. The process-global workspaces Map cannot alias: every open mints a unique tempdir, so concurrent fanout workers grade inde
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/hard-coding-env.ts

console.log('═══ CALIBRATION ($0) — task solvable + grader discriminates? ═══')

🟡 Cruft: magic number added examples/ablation-suite/hard-coding-env.ts

errStr: E${(r(9000, 4) + 1000).toString(36).toUpperCase()},

💰 Value Audit

🟡 pytest + AgenticSurface scaffold duplicated between two example tasks [duplication] ``

hard-coding-env.ts:300-413 clones pytestPassed (17 lines, byte-identical), the workspaces Map pattern, and the open/tools/call/score/close AgenticSurface shape verbatim from examples/self-improving-coder/self-improving-coder.ts:95-208. The calibrate() shape is also near-identical. This is intentional (the file header says 'Mirrors ... exactly in shape') and acceptable for a second example, but if a third contamination-proof task is added the scaffold should be lifted into a small shared helper u

🎯 Usefulness Audit

🟡 Discriminating substrate ships calibrated but not wired into the runnable ablation [integration] ``

hard-coding-env.ts exports hardCodingEnv/hardCodingTasks but no consumer imports them (grep across repo: only self-references). ablation.ts:28,338-339 still wires codingEnv/codingTasks — the saturated task the PR body argues is the whole reason for the new file. So pnpm tsx examples/ablation-suite/ablation.ts runs the new proposer + GEPA machinery against the substrate that cannot show lift, while the substrate that CAN show lift only runs under CALIBRATE=1. The PR's value prop (driver-steer/o

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260628T173124Z}

drewstone added 2 commits June 28, 2026 09:21

tangletools approved these changes Jun 28, 2026

View reviewed changes

tangletools reviewed Jun 28, 2026

View reviewed changes

drewstone merged commit 2aa6ec0 into main Jun 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy)#405

feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy)#405
drewstone merged 2 commits into
mainfrom
feat/proposer-optimization

drewstone commented Jun 28, 2026

Uh oh!

tangletools commented Jun 28, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 28, 2026

What

The finding this PR exists to record (the bottleneck taxonomy)

Verification

Uh oh!

tangletools commented Jun 28, 2026

✅ No Blockers — 9c0fc48d

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 2 non-blocking findings — 9c0fc48d

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound-with-nits

🔎 Heuristic Signals

💰 Value Audit

🎯 Usefulness Audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `9c0fc48d`

✅ Approved — 2 non-blocking findings — `9c0fc48d`