Skip to content

feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy)#405

Merged
drewstone merged 2 commits into
mainfrom
feat/proposer-optimization
Jun 28, 2026
Merged

feat(supervisor): proposer-profile optimization + discriminating eval (bottleneck taxonomy)#405
drewstone merged 2 commits into
mainfrom
feat/proposer-optimization

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

The proposer-profile optimization path for the self-improving supervisor — so the driver can actually beat a single agent given more compute, not just match it.

  • surface-worker.ts — thread the driver's brief into each worker (it was discarded: every spawn was an identical refine attempt). Steering now reaches the worker.
  • ablation.ts — the baseline driver prompt is now a real proposer: each worker a distinct, targeted hypothesis (direct fix → upstream cause → edge case → other module), not "try again".
  • gepa-driver-prompt.ts — GEPA optimizes the real driver/proposer prompt: each candidate runs a full selfImprovingSupervisor rollout, executable-graded by the supervised resolve (no LLM judge). Reports spend to the campaign cost meter so the backend-integrity guard sees a real backend.
  • hard-coding-env.ts — a mid-difficulty contamination-proof generated task (stack expression evaluator + edge cases) as the cheap optimization substrate (the original generated task is saturated; SWE-bench is too expensive to search prompt-space on). Calibrated reference→100% / stub→0% across 10 seeds.

The finding this PR exists to record (the bottleneck taxonomy)

Cost-aware ablation on real SWE-bench bugs: the driver-steered supervisor ties a single agent at ~10× cost, and so does more single-agent budget — because those bugs are capability-bottlenecked (8 independent shots also fail). Coordination adds search, not capability — it can only pay where the worker can solve the task but a single attempt misses the angle/edge-case (a search-bottlenecked regime). The hard eval is that regime; this PR ships the machinery + substrate to test it cheaply and certify winners on real tasks.

Verification

  • Typecheck (examples) clean; biome clean.
  • Adversarial review caught + fixed a blocker (cost not reported → integrity guard would abort the GEPA arm) + eval-integrity (dialect RNG aliasing) + train/serve mismatches (supervisor model + worker budget). All addressed.
  • All example/demo code; no changes to src/ (the keystone analyze knob it builds on is already on main via feat(supervise): propagated analyze knob — analyst feeds the driver (additive) #404).

…ifying proposer prompt

The seam discarded the driver's brief (v1: 'worker IGNORES the brief'), so driver-steer was just
expensive best-of-N identical workers — no steering. Thread the brief into each attempt (appended to
the surface task prompt) so a re-spawn can take a DIFFERENT, targeted angle. Rewrite the baseline driver
prompt as a real PROPOSER: each worker a distinct hypothesis (direct fix → upstream cause → edge case →
other module). This is the prerequisite for proposer-profile optimization to have any traction.
… substrate

- gepa-driver-prompt: GEPA now optimizes the REAL driver/proposer prompt — each candidate runs a full
  selfImprovingSupervisor rollout, executable-graded by the supervised resolve (no LLM judge). FIX
  (adversarial review P1, would crash the arm): report the supervised spend to ctx.cost.observe/
  observeTokens so the backend-integrity guard sees a real backend, not a zero-cost stub.
- hard-coding-env: a mid-difficulty contamination-proof generated task (stack expression evaluator) as
  the CHEAP optimization substrate — reference→100%/stub→0% across 10 seeds. FIX (P2): per-field salt
  on the dialect RNG (base⟺rounding were 100% aliased → spurious shortcut).
- ablation: thread the supervisor router + matched worker budget into the optimize call (P3: train/serve
  regime must match). Known caveats noted: error-credit floor (~33%, constant across candidates so the
  relative GEPA signal is unaffected) + syntax-error denominator (0 either way).

Adversarial review caught all of these pre-merge; verify phase: tsc+biome clean, eval calibrated.
@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 9c0fc48d

Review health 100/100 · Reviewer score 89/100 · Confidence 65/100 · 2 findings (2 low)

deepseek: Correctness 89 · Security 89 · Testing 89 · Architecture 89

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟡 LOW GEPA supervisor loop budget formula doesn't scale with arm budget — examples/ablation-suite/gepa-driver-prompt.ts

Line 124-127: the supervised run inside the GEPA agent uses a fixed budget formula (worker.innerTurns ?? 6) * 3 + 16 maxIterations, independent of worker.budget. Meanwhile ablation.ts:217 scales deployment's maxIterations with arm.knobs.budget. The per-worker refine budget IS correctly threaded (worker.budget flows to runAgentic), but the driver's loop-length ceiling differs between GEPA candidate evaluation and deployment. With default budget=2 the difference is minor (34 vs 32); with budget=10 it's significant (34 vs 96). A prompt optimized for a 34-iteration regime may not be optimal in a 96-iteration regime. Intentional tradeof

🟡 LOW No tests for GEPA agent execution path or brief-threading — examples/ablation-suite/surface-worker.ts

The execute(brief: unknown) rewrite and the gepa-driver-prompt.ts agent function (lines 105-143 — real selfImprovingSupervisor called per candidate) have no unit or integration tests. The hard-coding-env.ts calibration self-check is thorough (10 seeds, 0 LLM cost), covering the eval surface. But the supervisor loop wiring (selfImprovingSupervisor → surfaceWorkerSeam → surfaceWorkerExecutor with brief threading) and the GEPA cost-reporting path (ctx.cost.observe/observeTokens) remain unvalidated against a live harness. Low severity because these are example files, not library code.


tangletools · 2026-06-28T17:30:27Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 2 non-blocking findings — 9c0fc48d

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T17:30:27Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (2 low, 2 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 339.8s (2 bridge agents)
Total 339.8s

💰 Value — sound-with-nits

Threads the driver's brief into workers, rewires GEPA to optimize the real supervisor prompt (not a proxy), adds train/serve-match + cost reporting, and ships a mid-difficulty headroom task — each a real correctness fix; minor scaffold duplication vs the existing self-improving-coder example.

  • What it does: Four targeted changes in examples/ablation-suite/: (1) surface-worker.ts:73-80 now accepts the driver's brief and appends it to the task's systemPrompt, so each spawn can take a different angle (previously every spawn was an identical refine retry — the file's own prior comment marked this as a v1 simplification). (2) ablation.ts:35-51 replaces the terse baseline driver prompt with a proposer-st
  • Goals it achieves: Make the driver-steered supervisor actually capable of beating a single agent at higher compute. The PR body's bottleneck taxonomy frames it: on capability-bottlenecked bugs the supervisor can only tie; to beat a single agent it needs a SEARCH-bottlenecked regime where diversified attempts pay. The four changes deliver the machinery for that: brief-threading + a proposer baseline make attempts act
  • Assessment: Coherent and in-grain. Each change fixes a real correctness gap rather than adding surface area: the v1 simplification is retired with the exact signature the Executor interface already supported (surface-worker.ts:73); the train/serve match prevents a classic skew bug; the cost.observe call is necessary (without it the integrity guard aborts on a stub-looking cell); and the proxy→real rewiring in
  • Better / existing approach: Checked for an existing host-pytest AgenticSurface helper to extend rather than clone (rg 'pytestPassed|export const.*AgenticSurface|hostPytest|pytestSurface' across src/ and examples/). None exists in src/ — the only two implementations of this exact pattern are examples/self-improving-coder/self-improving-coder.ts:95 and the new examples/ablation-suite/hard-coding-env.ts:300. The new file is rig
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound-with-nits

Real proposer + GEPA-over-supervisor wiring that lands in the grain of the codebase; the new discriminating substrate ships calibrated but not yet swapped into the runnable ablation.

  • Integration: The three wiring changes all reach real consumers. (1) surface-worker.ts:73-80 — the driver's spawn brief reaches the worker: spawn_agent's task arg flows scope.spawn → runChild → executor.execute at src/runtime/supervise/scope.ts:585 and src/mcp/tools/coordination.ts:437, so reading it as brief and folding it into the surface task systemPrompt is correct (TS narrower-arity is fine). (2) ablat
  • Fit with existing patterns: Excellent. hard-coding-env.ts mirrors examples/self-improving-coder/self-improving-coder.ts in shape exactly (same AgenticSurface open/tools/call/score/close, same seed-derived tasks supplier, same reference/stub calibration self-check at lines 561-586 vs coder lines 251-257) — it extends the established pattern rather than inventing one. gepa-driver-prompt.ts now routes through selfImprovingSuper
  • Real-world viability: Holds up. Each task gets its own mkdtempSync tempdir (hard-coding-env.ts:327) cleaned in close (line 407-412) — the same crash-leak risk self-improving-coder already carries, not new. pytestPassed (line 300) has a 60s timeout and parses partial stdout on failure — realistic. The process-global workspaces Map cannot alias: every open mints a unique tempdir, so concurrent fanout workers grade inde
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/hard-coding-env.ts

  • console.log('═══ CALIBRATION ($0) — task solvable + grader discriminates? ═══')

🟡 Cruft: magic number added examples/ablation-suite/hard-coding-env.ts

  • errStr: E${(r(9000, 4) + 1000).toString(36).toUpperCase()},

💰 Value Audit

🟡 pytest + AgenticSurface scaffold duplicated between two example tasks [duplication] ``

hard-coding-env.ts:300-413 clones pytestPassed (17 lines, byte-identical), the workspaces Map pattern, and the open/tools/call/score/close AgenticSurface shape verbatim from examples/self-improving-coder/self-improving-coder.ts:95-208. The calibrate() shape is also near-identical. This is intentional (the file header says 'Mirrors ... exactly in shape') and acceptable for a second example, but if a third contamination-proof task is added the scaffold should be lifted into a small shared helper u

🎯 Usefulness Audit

🟡 Discriminating substrate ships calibrated but not wired into the runnable ablation [integration] ``

hard-coding-env.ts exports hardCodingEnv/hardCodingTasks but no consumer imports them (grep across repo: only self-references). ablation.ts:28,338-339 still wires codingEnv/codingTasks — the saturated task the PR body argues is the whole reason for the new file. So pnpm tsx examples/ablation-suite/ablation.ts runs the new proposer + GEPA machinery against the substrate that cannot show lift, while the substrate that CAN show lift only runs under CALIBRATE=1. The PR's value prop (driver-steer/o


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T173124Z

@drewstone drewstone merged commit 2aa6ec0 into main Jun 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants