Skip to content

feat(examples): self-improving-coder — the RSI spine, composed cleanly, contamination-proof#402

Merged
drewstone merged 6 commits into
mainfrom
examples/self-improving-coder
Jun 28, 2026
Merged

feat(examples): self-improving-coder — the RSI spine, composed cleanly, contamination-proof#402
drewstone merged 6 commits into
mainfrom
examples/self-improving-coder

Conversation

@drewstone

@drewstone drewstone commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What — the self-improvement spine, composed cleanly, two ways

1. examples/self-improving-coder/ — the substrate's RSI loop with NOTHING hand-rolled: an AgentProfile worker over an AgenticSurface, gated by runStrategyEvolution (authors strategies from TRAIN losses → ONE promotion decision on a FRESH holdout via promotionGate). Adaptive data analysis is structurally impossible (disjoint offsets, holdout read once). Bundled task is a contamination-proof generated library (constants per-seed → no memorization), graded by real pytest. $0 calibration gate (reference→100%, stub→0%). The toy task is deliberately simple — a capable model aces it, so the gate correctly returns no-promotion (you can't show self-improvement without headroom; the harness refuses to fake it).

2. bench/src/swe-bench-env.ts + bench/swe-self-improve.mts — the proper, no-cheating frontier run: SWE-bench Verified as an AgenticSurface. The agent clones the repo at base_commit, explores + makes surgical edits via tools (source only — test files path-jailed), and score() grades the git diff with the official swebench Docker harness. The substrate drives the agentic loop (no hand-rolled tool-loop, no seeing the hidden tests or gold patch). runStrategyEvolution enforces a disjoint train→freeze→holdout split — adaptive reuse impossible. Frontier worker (gemini-2.5-pro). CALIBRATE mode runs the baseline on a few bugs first (cost gate).

The discipline (why this is trustworthy)

  • No reinvention — agent, loop, gate are all substrate primitives (the build-with-agent-runtime do-NOT-reinvent list). Only the Environments are new.
  • No cheating — the held-out gate is enforced by the substrate, not by me; the harness caught a saturated task and refused to report a win.
  • No silent contamination — the SWE-bench arena's caveat (public fixes may be memorized) is documented in code; never claim a clean frontier number from it alone.

Verified: swebench gold-patch grading resolves a real astropy bug in Docker (97s); the harness runs end-to-end and the gate fires correctly. SWE-bench Environment calibration in progress.


UPDATE — the proper frontier run is PROVEN to work (calibration, gemini-2.5-pro)

gemini-2.5-pro over the SWE-bench Environment, baseline refine, on real Verified bugs:
astropy-13236 RESOLVED (17 tool-calls, 7 min of real exploration + surgical edits, confirmed by the official swebench Docker harness) — astropy-12907/-13033 unresolved (the agent gave up early at 1-2 turns). Baseline = 1/3 = 33% — the correctable middle band (real headroom, no cheating: it never saw the hidden tests or gold). The full runStrategyEvolution run (train→freeze→holdout) is firing — can a learned strategy make the flaky agent persist and resolve MORE?

…y, on a contamination-proof task

The pristine self-improvement loop with NOTHING hand-rolled: an AgentProfile-shaped worker over an
AgenticSurface (the task, real tools), gated by runStrategyEvolution — which authors strategies from
TRAIN losses then makes ONE promotion decision on a FRESH holdout slice via promotionGate. Adaptive
data analysis is structurally impossible (disjoint task offsets, holdout read once). The only new code
is the Environment: a contamination-proof generated coding task (constants derived per-seed, so no
model could have memorized it), graded by real pytest. $0 calibration self-check (reference->100%,
stub->0%) gates spend. The bundled task is deliberately simple — a capable model aces it, so the gate
correctly returns no-promotion; swap a harder Environment (or SWE-bench) for a discriminating run.
…-cheating frontier run

createSweBenchEnvironment: the agent clones the repo at base_commit, explores + makes SURGICAL
edits via tools (edit_file, source-only, test files path-jailed), and score() grades the git diff
with the official swebench Docker harness. The substrate drives the agentic loop (runAgentic /
runStrategyEvolution) — no hand-rolled tool-loop. Never sees the hidden tests or the gold patch.

swe-self-improve.mts wires it into runStrategyEvolution with a disjoint train/holdout split (the
substrate enforces freeze + one holdout decision — no adaptive reuse). CALIBRATE mode runs the
baseline on a few bugs first (cost gate). CONTAMINATION CAVEAT documented: public fixes may be
memorized; report it, never claim a clean frontier number from this arena alone.
…sist-and-edit prompt

Calibration showed gemini-2.5-pro returning empty (no tool calls) without a maxTokens cap, then stopping
after ~3 turns without editing. Set worker maxTokens=8000 and a prompt that forces broad exploration +
at least one edit_file attempt. Log completions/shots in CALIBRATE mode for headroom diagnosis.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 5 (1 medium-concern, 2 low, 2 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 174.2s (2 bridge agents)
Total 174.2s

💰 Value — sound-with-nits

Adds a tool-editing SWE-bench AgenticSurface + an evolution runner (sound, substrate-composed) plus a third self-improvement example whose spine duplicates the canonical one and whose task is by-design saturated.

  • What it does: Three files. (1) bench/src/swe-bench-env.ts: wraps the existing createSweBenchAdapter judge behind a NEW AgenticSurface where the agent clones the repo at base_commit, explores/edits SOURCE via list_files/read_file/edit_file (test files path-jailed), and score() runs git diff through the official swebench Docker harness; a disjoint-slice task supplier keys tasks by dataset offset so train/holdou
  • Goals it achieves: (a) Give SWE-bench Verified a no-cheating agentic surface — the agent makes surgical edits like a real engineer and is graded by the official harness, never seeing the gold patch or hidden tests. (b) Provide the frontier self-improvement run (evolve strategies on real bugs, promote on a frozen holdout). (c) Demonstrate the substrate's RSI spine on a coding-flavored task that is structurally imposs
  • Assessment: Good change on the bench side. The SWE-bench env does NOT duplicate the existing swe-bench adapter: the existing path emits a diff as text (bench/src/benchmarks/swe-bench.ts:34 swePatchOutput); the new path edits files via tools and scores git diff — a materially different agentic surface, and there is no existing swe-bench evolution runner to extend. CALIBRATE is a sensible cost gate and the ru
  • Better / existing approach: Looked at examples/strategy-evolution/, examples/strategy-suite/, examples/self-improving-loop/, bench/src/commit0-env.ts, bench/src/commit0-env-run.mts, and bench/src/benchmarks/swe-bench.ts. For the bench files: no better approach — the tool-editing SWE-bench surface is net-new and the runner is the right primitive. For the example: a materially cleaner home exists. examples/strategy-suite/count
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound-with-nits

The SWE-bench-as-edit-via-tools AgenticSurface is a genuinely new, well-fitting capability; the companion example is a 4th self-improvement composition that overlaps the existing strategy-evolution/ example and isn't wired into the README catalog.

  • Integration: Both new files are reachable by direct tsx invocation, matching the established bench-runner pattern (createCommit0Environment/createSweBenchEnvironment are consumed by their sibling *-run.mts, not exported from bench/src/index.ts — see commit0-env.ts:80 vs swe-bench-env.ts:36). Every API surface checks out against the source: AgenticSurface (strategy.ts:76-83), runAgentic returning {resolved,comp
  • Fit with existing patterns: The AgenticSurface env is exactly the codebase grain — bench/src/commit0-env.ts is the template (clone at base_commit → path-jailed file tools → run the real test suite in score()), and swe-bench-env.ts mirrors it faithfully including the edit-source-not-tests jail. The runner (swe-self-improve.mts) follows commit0-env-run.mts's shape. The example, however, is the repo's 4th self-improvement compo
  • Real-world viability: Adequate beyond the happy path. swe-bench-env uses partial clone (--filter=blob:none --no-checkout) with generous timeouts for large repos; per-task tmpdirs key the module-global workspaces Map so concurrent tasks don't collide; score() degrades gracefully (empty diff → 0/1, judge throw → errored). The example's pytest runs on host (documented as a Docker swap for untrusted code) which is fine for
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/swe-self-improve.mts

  • console.log(═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)

🟡 Cruft: magic number added bench/swe-self-improve.mts

  •  console.log(`  ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
    

💰 Value Audit

🟠 self-improving-coder example re-teaches the runStrategyEvolution spine already in examples/strategy-evolution/ [duplication] ``

examples/self-improving-coder/self-improving-coder.ts:118-286 wires runStrategyEvolution + a disjoint (offset,n) task supplier + the holdout gate — structurally the same code as examples/strategy-evolution/strategy-evolution.ts:58-88 (compare the two main()s side by side). The example set already has TWO self-improvement demos (strategy-evolution, self-improving-loop). The PR body concedes the bundled task is deliberately saturated so the gate returns no-promotion, so as a self-improvement demon

🟡 swe-bench-env.ts duplicates the git-repo-editing AgenticSurface shell from commit0-env.ts [duplication] ``

bench/src/swe-bench-env.ts and bench/src/commit0-env.ts share the same skeleton: a module-level workspaces: Map<string, Ws>, open() that mkdtempSyncs + git clones + checks out base_commit (swe-bench-env.ts:48-57 vs commit0-env.ts:88-90), a tools() returning list_files/read_file/edit-or-write_file, a call() with an inline safe()/jail() path normalizer rejecting '..' and absolute paths, and a close() with rmSync. The list_files bounded-recursive walker (swe-bench-env.ts:74-92) is also hand-r

🎯 Usefulness Audit

🟡 self-improving-coder example overlaps strategy-evolution/ and is absent from the README learning path [problem-fit] ``

examples/README.md Tier 5 already catalogs three self-improvement examples (#16 strategy-evolution, #17 improve, #18 self-improving-loop). The new examples/self-improving-coder/self-improving-coder.ts is a near-structural twin of #16 (same runStrategyEvolution config shape, same disjoint-slice tasks() supplier at :172, same [sample, refine] baselines, same saturated-toy-returns-no-promotion outcome the README itself describes at strategy-evolution/README.md:32). Its genuine contributions — the c


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T023004Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — ed844764

Readiness 48/100 · Confidence 75/100 · 22 findings (2 medium, 20 low)

glm deepseek aggregate
Readiness 62 48 48
Confidence 75 75 75
Correctness 62 48 48
Security 62 48 48
Testing 62 48 48
Architecture 62 48 48

Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Error handler drops stack trace, inconsistent with other bench runners — bench/swe-self-improve.mts

Line 82: console.error(e) prints only e.toString(), discarding the stack trace. Every other bench runner in this repo prints the full stack: commit0-env-run.mts:59 uses e.stack ?? e.message, trata-gate.mts:241 uses err.stack ?? err.message. This makes CI/debugging failures significantly harder — a TypeError inside runStrategyEvolution would show only 'TypeError: ...' with no line numbers. Fix: main().catch((e) => { console.error(e instanceof Error ? (e.stack ?? e.message) : String(e)); process.exit(1); })

🟠 MEDIUM read_file lacks path traversal protection that write_file has — examples/self-improving-coder/self-improving-coder.ts

Line 135: readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8') has no validation. Line 142: write_file correctly guards with p.endsWith('lib.py') || p.includes('..') || p.startsWith('/'). The agent can read arbitrary files via read_file('../../etc/passwd'). In this no-Docker example, Python code executed via execFileSync already has filesystem access, so this is not an escalation — but the asymmetry is misleading for users who copy this pattern assuming both file ops a

🟡 LOW No unit tests for the new env's integrity guards — bench/src/swe-bench-env.ts

No .test. covers swe-bench-env.ts (grep for swe-bench-env/createSweBenchEnvironment across test specs: empty). The path-jail safe(), the isTestPath gate, the edit_file single-occurrence rule, and the empty-diff vs errored score() branches are pure-ish, cheat-prevention-flavored logic that a regression test would pin cheaply. Sibling commit0-env.ts also lacks unit tests, so this is repo-conventional — but the SWE env leans harder on these guards for its 'no cheating' claim, raising the value of a test. Fix: add a small test exercising safe()/isTestPath/score()-on-empty-diff (the functions would need to be exported or tested via the surface contract).

🟡 LOW REJECTED responses not counted as toolErrors in runShot — bench/src/swe-bench-env.ts

Line 114: edit_file returns REJECTED: editing test files is forbidden... when the path matches isTestPath(). The runShot function (src/runtime/strategy.ts:168) only increments toolErrors for responses starting with ERROR:REJECTED: responses are silently counted as successful tool calls. This understates the tool-error count in benchmark statistics. Impact: minor observability gap; the agent still receives clear rejection feedback, so decision-making is not affected.

🟡 LOW Tmpdir leaks when git clone/checkout fails in open() — bench/src/swe-bench-env.ts

Lines 50-52: mkdtempSync runs BEFORE the git clone (420s timeout) and checkout (300s timeout). If either throws (network blip, unreachable base_commit, GitHub rate-limit, timeout), the created tmpdir is never removed — no try/catch cleanup around the post-mkdtemp work. The sibling commit0-env.ts:94-99 shows the intended pattern (clean up dir + container on setup failure). Impact: orphaned dirs under os.tmpdir() accumulate across flaky benchmark runs; not a correctness issue but a resource leak on the long-running multi-task loop. Fix: wrap the clone+checkout in try/catch and rmSync(dir,{recursive:true,force:true}) on rethrow.

🟡 LOW adapter.judge() in score() has no timeout — Docker harness can hang indefinitely — bench/src/swe-bench-env.ts

Line 143-145: adaptor.judge(ws.task, patch) is awaited without a timeout. The adapter delegates to runStagedJudge() (swe-bench.ts:122) which passes no timeoutMs, so execFileAsync (node:child_process) runs with timeout: undefined = no timeout. The SWE-bench Docker harness can pull images, build containers, and run test suites — any hang (Docker daemon stall, network timeout pulling images, test infinite loop) blocks the entire eval run. The git diff call at line 137 correctly has a 60s timeout, making this inconsistency mo

🟡 LOW adapter.preflight() is never invoked — misconfigured env surfaces as silent all-errored run — bench/src/swe-bench-env.ts

createSweBenchEnvironment constructs the adapter (line 47) but never calls adapter.preflight(), and the sole caller bench/swe-self-improve.mts doesn't either. The adapter defines preflight() (benchmarks/swe-bench.ts:62-70) specifically to emit actionable guidance ('pip install swebench; ensure Docker is running'). Without it, a missing venv or stopped Docker daemon makes every score() call throw inside adapter.judge, caught at line ~140, returning {passes:0,total:1,errored:1} per task — the whole run silently scores zero with no 'install swebench / start Docker' message. Impact: confusing first-run experience; operator must infer infra absence from universal 'erro

🟡 LOW isTestPath anti-cheat is a heuristic; doc claims 'by construction' — bench/src/swe-bench-env.ts

Line 16 regex: /(^|/)(tests?)// + /test_..py$|test.py$|conftest.py$/ blocks the common Python conventions (test/, tests/, test.py, *_test.py, conftest.py). The header doc comment (lines 1-12) asserts 'No cheating by construction: ... edit_file refuses test files.' The regex misses edge layouts: a file literally named 'test.py' at root (no dir slash, no underscore), a 'testing/' dir, or non-Python test files. Scoring integrity still holds because the official SWE-bench harness re-applies test_patch on top of the model patch (go

🟡 LOW list_files follows symlinks via statSync, bypassing workspace boundary — bench/src/swe-bench-env.ts

Line 90: statSync(p).isDirectory() follows symlinks. If a cloned repo contains a symlink to a directory outside the workspace (e.g., /etc), walk() would recursively traverse into it, exposing files beyond the workspace. The safe() path check at line 68-71 only guards against .. traversal and absolute paths, not symlink-following. Fix: use lstatSync(p).isDirectory() to avoid following symlinks, or add a realpath-bounded check. Impact: low — SWE-bench repos are trusted public repos with no known malicious symlinks.

🟡 LOW CALIBRATE env var uses truthy check instead of explicit comparison — bench/swe-self-improve.mts

Line 28: if (process.env.CALIBRATE) is truthy for any non-empty string including '0', 'false', 'no'. The header comment at line 7 documents CALIBRATE=1 as the intended activation value. Someone setting CALIBRATE=0 or CALIBRATE=false expecting to disable calibration would get a surprise. Fix: compare explicitly: process.env.CALIBRATE === '1'

🟡 LOW Champion scores omitted from final report output — bench/swe-self-improve.mts

Lines 72-73 print gen0 champion: and final champion: with only names, while the EvolutionReport carries the scores (gen0Champion.score, finalChampion.score on scale 0–1). The held-out lift is reported (line 79), but without the gen0 champion's holdout score the reader cannot assess the baseline vs final gap. The commit0-env-run.mts analog uses printBenchmarkReport which includes per-strategy scores. Adding the scores here would make the output self-contained for a reader who didn't run the calibration step. Fix: include `report.ge

🟡 LOW Script lives outside CI typecheck scope — bench/swe-self-improve.mts

bench/tsconfig.json include: ["src/**/*.ts", "src/**/*.mts"] excludes top-level bench/*.mts; root CI runs tsc --noEmit (src only) + tsc -p tsconfig.examples.json (examples only); bench/package.json has no typecheck script. This 79-line file is never statically checked — type drift (e.g., a renamed field on StrategyEvolutionConfig) would only surface at runtime via tsx, which skips typecheck. Manually verified all types match as of this commit, so not blocking, but adding bench/swe-self-improve.mts (and any sibling top-level scripts) to bench/tsconfig.json include would prevent silent breakage on future substrate renames.

🟡 LOW .swe-run-* missing from bench/.gitignore — bench/swe-self-improve.mts

mkdtempSync(join(process.cwd(), '.swe-run-')) writes into the bench/ cwd, but bench/.gitignore (11 lines) has no .swe-run-*/ entry (it ignores .tmp-e3-*/, run-artifacts/, etc.). Combined with finding 1, a failed run leaves untracked dirs that surface in git status and risk being accidentally committed. Fix: add .swe-run-*/ to bench/.gitignore.

🟡 LOW outDir cleanup skipped on throw — leaks .swe-run-* on failure — bench/swe-self-improve.mts

Lines 46-58: rmSync(outDir, { recursive: true, force: true }) runs ONLY on the success path after runStrategyEvolution returns. If the multi-hour runStrategyEvolution throws (e.g., holdout phase, author failure, OOM), control jumps to main().catch() which calls process.exit(1) — the rmSync never runs and the .swe-run-XXXX/ dir full of authored strategy .ts files persists on disk. Fix: move rmSync into a finally block around the runStrategyEvolution call, OR document that the leak-on-failure is intentional for post-mortem inspection. Not blocking — bench-only, single user.

🟡 LOW DUMP and CALIBRATE env checks use truthiness, not specific values — examples/self-improving-coder/self-improving-coder.ts

Line 227: if (process.env.CALIBRATE)CALIBRATE=0 or CALIBRATE=false would still trigger calibration (non-empty strings are truthy). Line 261: same pattern for DUMP. These are minor for an example script but could surprise users. Fix: compare against specific values: if (process.env.CALIBRATE === '1').

🟡 LOW Host pytest on model-written lib.py (no sandbox) — examples/self-improving-coder/self-improving-coder.ts

execFileSync('python3', [...], { cwd: dir }) at line 94 imports and runs the agent-authored lib.py directly on the host (line 145 writes whatever the model produced). The comment at line 84 acknowledges this (Docker is a swap for untrusted code) — appropriate for a deliberate example, but anyone copy-pasting the canonical exam

🟡 LOW Orphaned .sic-run-* directories on evolution failure — examples/self-improving-coder/self-improving-coder.ts

Line 239: outDir = mkdtempSync(join(process.cwd(), '.sic-run-')) — created in project root. Line 258: rmSync(outDir, ...) only runs if runStrategyEvolution succeeds. If it throws, the temp directory persists in the project tree with no OS cleanup (unlike /tmp). Since the process exits on error (line 284-285), the actual

🟡 LOW Number(process.env.X ?? default) silently accepts empty string — examples/self-improving-coder/self-improving-coder.ts

Lines 243-255 parse TRAIN_N/HOLDOUT_N/BUDGET/GENERATIONS/POP/INNER_TURNS via Number(process.env.X ?? default). If a user exports TRAIN_N= (empty), Number('') returns 0, not the default — silently producing a zero-task train slice. Fix: Number(process.env.X || default) or Number(process.env.X ?? default) || default. Trivial footgun, example-only.

🟡 LOW outDir leaks on runStrategyEvolution throw — examples/self-improving-coder/self-improving-coder.ts

Line 239 creates outDir = mkdtempSync(join(process.cwd(), '.sic-run-')) and line 258 cleans it with rmSync(outDir, { recursive: true, force: true }). If runStrategyEvolution (lines 240-257) throws — router outage, author syntax error, benchmark crash — the cleanup never runs and .sic-run-* accumulates under the projec

🟡 LOW run_tests handler exists despite the NO run_tests comment — examples/self-improving-coder/self-improving-coder.ts

Lines 125-126 comment says NO run_tests: the agent cannot iterate-until-green, and tools() (lines 120-128) correctly omits it from the manifest. But call() at lines 151-154 still handles name === 'run_tests' and returns live pytest results. The AgenticSurface contract (src/runtime/strategy.ts:76-83) does

🟡 LOW codingEnv.tools() takes no params but interface expects (task, handle) — examples/self-improving-coder/self-improving-coder.ts

Line 120: async tools() takes no parameters. The AgenticSurface interface at strategy.ts:79 declares tools(task: AgenticTask, handle: ArtifactHandle): Promise<AgenticTool[]>. TypeScript allows fewer params to satisfy the interface (structural compatibility), and the current tools are static (no task/handle needed). But if tools() ever becomes dynamic per-handle, this silently breaks. Fix: accept the params (even if unused): async tools(_task: AgenticTask, _handle: ArtifactHandle) for forward-compatibility.

🟡 LOW run_tests handler exists but tool is deliberately not exposed — examples/self-improving-coder/self-improving-coder.ts

Lines 125-126: comment says 'NO run_tests: the agent cannot iterate-until-green.' Lines 151-153: call() handles run_tests and returns real pytest results. If a generated strategy or hallucinating model invokes run_tests, the no-iteration constraint is silently defeated. The handler returning real results contradicts the stated intent. Fix: either remove the run_tests handler entirely (return error), or explicitly document it as a debug-only escape hatch.


tangletools · 2026-06-28T02:39:10Z · trace

tangletools
tangletools previously approved these changes Jun 28, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 22 non-blocking findings — ed844764

Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T02:39:10Z · immutable trace

The instrument for 'what actually helps': a configurable agent where each self-improvement technique is
a knob (topology/trace-analysis/steering/GEPA-skillopt/persistent-artifact), swept one-knob-at-a-time
(O(N) not 2^N) at equal compute, with a full autopsy — resolve AND token/$/latency per arm — so we see
what helps vs what just burns tokens. WIRED: topology (refine/sample/sampleThenRefine) + budget. The
rest are DECLARED knobs that FAIL LOUD if set (no silent no-op — names the substrate primitive to wire).
Exports codingEnv/codingTasks from self-improving-coder (guarded main) for the cheap validation fixture.
tangletools
tangletools previously approved these changes Jun 28, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — fb6f682a

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T03:31:34Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (1 medium-concern, 2 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 1083.5s (2 bridge agents)
Total 1083.5s

💰 Value — sound-with-nits

Adds the RSI spine as a contamination-proof coding example + a proper SWE-bench AgenticSurface + a frontier self-improvement runner — 3 of 4 files are clean composition over substrate primitives; the ablation file reimplements runBenchmark's wired subset and should be rebuilt on top of it.

  • What it does: Four additions. (1) examples/self-improving-coder: an AgentProfile worker over a generated wire-protocol coding Environment, gated by runStrategyEvolution; the bundled task is contamination-proof (constants derived per-seed so no model memorized the fix), with a $0 calibration self-check (reference→all-pass, stub→0). (2) bench/src/swe-bench-env.ts: wraps the existing createSweBenchAdapter (bench/s
  • Goals it achieves: (a) Give the substrate a self-improvement example on a task a frontier model provably hasn't memorized (the generated-constants idea), so the honest no-promotion result is trustworthy rather than an artifact of contamination. (b) Provide the real frontier path — SWE-bench Verified driven as an AgenticSurface with surgical source-only edits, scored by the official harness, with the contamination ca
  • Assessment: Mostly in the grain. swe-bench-env.ts follows the exact commit0-env.ts precedent (same file shape, same AgenticSurface hooks, reuses the existing createSweBenchAdapter for judging rather than forking it — clean). swe-self-improve.mts mirrors bench/src/examples/strategy-demo.mts's entry-point shape. self-improving-coder is structurally a domain-swap of examples/strategy-evolution/strategy-evolution
  • Better / existing approach: For 3 of 4 files: none — right approach, reuses createSweBenchAdapter and the substrate primitives correctly. For the ablation file: a materially better approach exists (build on runBenchmark, below). Searched: bench/src/-env.ts (found commit0-env.ts precedent), bench/src/examples/.mts (found strategy-demo.mts), src/runtime/run-benchmark.ts (found runBenchmark already does cost-aware strategy co
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

All four files implement real capabilities composed from substrate primitives following the exact patterns of existing siblings (commit0-env, strategy-evolution), with no reinvention and no dead surface.

  • Integration: All new code wires into the substrate correctly and is reachable. swe-bench-env.ts implements AgenticSurface identically to the sibling commit0-env.ts:80 and is called by swe-self-improve.mts:14 — the standard CLI-entrypoint pattern used by commit0-env-run.mts. self-improving-coder.ts exports codingEnv + codingTasks consumed by ablation-suite/ablation.ts:24 and is a self-containe
  • Fit with existing patterns: Every new surface follows established patterns. swe-bench-env.ts mirrors commit0-env.ts exactly (5-method AgenticSurface + module-level workspaces Map + path-jailed tools + external harness scoring). self-improving-coder.ts mirrors strategy-evolution/strategy-evolution.ts (AgenticSurface + disjoint task supplier + runStrategyEvolution call). The ablation suite's cost-aware comparison (
  • Real-world viability: All three AgenticSurface implementations handle edge inputs: path traversal rejection (..//), file truncation (24k/8k), missing-workspace guards, empty-patch scoring, fail-loud errors on unknown tools/tasks. The swe-bench-env mirrors commit0-env's workspace lifecycle but omits Docker root-ownership cleanup (commit0-env.ts:167-169) — this is correct because swe-bench-env runs the judg
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/swe-self-improve.mts

  • console.log(═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)

🟡 Cruft: magic number added bench/swe-self-improve.mts

  •  console.log(`  ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
    

💰 Value Audit

🟠 ablation-suite reimplements runBenchmark's wired subset by hand [duplication] ``

examples/ablation-suite/ablation.ts:101 calls runAgentic in a manual per-arm × per-task loop, accumulating score + tokens + $ + latency (the 'cost-aware autopsy' claimed at lines 2-6 and 139-165). But src/runtime/run-benchmark.ts:132 (runBenchmark) ALREADY does this — and more: it takes strategies (which IS the topology knob — sample/refine/sampleThenRefine), runs them at equal budget, tracks per-strategy score/resolved/usd/ms (BenchmarkStrategySummary, line 85-93), produces a Pareto frontie

🟡 new bench entries not registered in bench/HARNESS.md (the repo's anti-rediscovery map) [maintenance] ``

bench/HARNESS.md:101 documents commit0-env.ts + its companion runner (commit0-env-run.mts) as the canonical 'HARD domain through runBenchmark' entry — and CLAUDE.md:42 frames a stale map as a defect ('if a map disagrees with the code, the code wins — fix the map in the same turn'). The new swe-bench-env.ts + swe-self-improve.mts are the same kind of entry (a benchmark-as-AgenticSurface + a runner) but are absent from HARNESS.md's command map. The next agent orienting in bench/ will not discover


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T035129Z

…teering knob at the driver-steers-worker loop

Adds task-aligned per-task resolve vectors + pairedBootstrap 95% CI on every arm's Δresolve (✓ = CI
excludes 0 = real lift) — no more point lifts. Reframes the rich knobs to the RIGHT primitives: the
steering knob is the supervise() driver-steers-worker loop (driver composes the next prompt from the
analyst's analyzeOnSettle finding — a driver brain in the loop, not the inline analyst-steerer); the
optimize knob is selfImprove() with an executable JudgeConfig optimizing the driver's compose-prompt
on TRAIN, frozen. Both fail loud until wired.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — bd127783

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T04:15:12Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 135.5s (2 bridge agents)
Total 135.5s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — sound

Four files compose a coherent self-improvement spine: a contamination-proof coding task for runStrategyEvolution, a cost-aware one-knob ablation runner, and SWE-bench wired as an AgenticSurface — all built entirely on substrate primitives, following the established strategy-evolution example p

  • Integration: All four files import and compose only substrate primitives from @tangle-network/agent-runtime/loops and @tangle-network/agent-eval. self-improving-coder exports codingEnv/codingTasks, consumed by ablation-suite. swe-bench-env.ts wraps the existing createSweBenchAdapter (bench/src/benchmarks/swe-bench.ts) behind the standard AgenticSurface seam — the same seam `runStrategyEvolu
  • Fit with existing patterns: The pattern — implement AgenticSurface, supply a disjoint-slice (offset, n) => AgenticTask[] supplier, call runStrategyEvolution — is the codebase's own architecture, established in examples/strategy-evolution/strategy-evolution.ts:40-58. self-improving-coder and swe-self-improve are domain-specific instantiations of that pattern (coding + SWE-bench), not competitors. ablation-suite
  • Real-world viability: Cleanup is guaranteed: strategy.ts:464-466 calls surface.close(handle) in a finally block for every open(). Path safety is enforced (swe-bench-env.ts:68-71 rejects .. and absolute paths). Git operations have timeouts (clone 420s, checkout 300s). Score methods catch errors and return {errored:1} rather than crashing. Workspace lifecycle is idempotent (close uses force:true rmSync, o
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/swe-self-improve.mts

  • console.log(═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)

🟡 Cruft: magic number added bench/swe-self-improve.mts

  •  console.log(`  ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T041920Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — bd127783

Readiness 2/100 · Confidence 80/100 · 29 findings (1 high, 9 medium, 19 low)

glm deepseek aggregate
Readiness 2 16 2
Confidence 80 80 80
Correctness 2 16 2
Security 2 16 2
Testing 2 16 2
Architecture 2 16 2

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH Resource leak: temp dirs orphaned on git clone/checkout failure in open() — bench/src/swe-bench-env.ts

Lines 51-56: mkdtempSync creates the temp dir BEFORE the two await exec() calls. If git clone (L52) or git checkout (L53) throws, the temp dir is never cleaned up — the workspace is only registered in the Map on L55, so close() will never see it. On a benchmark run with 80 instances, each failed clone leaks a directory (potentially hundreds of MB). Fix: wrap lines 52-56 in try/catch with rmSync(dir, {recursive:true,force:true}) in the catch before re-throwing.

Other

🟠 MEDIUM Symlink-following defeats the path jail in read_file/edit_file — bench/src/swe-bench-env.ts

Lines 68-71, 105, 119, 127. safe() only inspects the STRING (.. / leading /), but readFileSync(join(ws.dir, p)) and writeFileSync(join(ws.dir, p)) follow symlinks. SWE-bench repos are real GitHub checkouts that can contain symlinks (docs/, bin/, build shims). An agent calling read_file on a repo-relative symlink that targets /etc/passwd or ../../../etc/passwd exfiltrates arbitrary host files; edit_file via writeFileSync through such a symlink WRITES to the symlink target — a host-integrity risk, not just read. The file header ([lines 4-9](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/bench/src/swe-bench

🟠 MEDIUM Tempdir leak when git clone/checkout fails in open() — bench/src/swe-bench-env.ts

Lines 51-55. mkdtempSync creates the dir first; git clone (420s timeout) and git checkout (300s timeout) run before workspaces.set(dir, ...) on line 55. If either git op throws (network blip, unknown base_commit, transient HF/Docker error), the function throws, the caller never receives a handle, close() is never called, AND the dir was never registered in workspaces — so it leaks permanently. Each clone is a full SWE-bench repo (hundreds of MB). A long run with flaky networking can fill /tmp. Fix: wrap in try/catch that c

🟠 MEDIUM Zero test coverage for the entire surface implementation — bench/src/swe-bench-env.ts

Confirmed via glob + grep: no test files exist for swe-bench-env.ts anywhere in the repo. The file implements path safety (safe()), a test-path regex (isTestPath), tool dispatch (3 tools + unknown guard), edit_file dedup/validation logic (count via split, old_string empty check), workspace lifecycle, and the tasks() supplier boundary check. Every one of these should have at least unit coverage. Without tests, a regression in the safety guard or score() error handling goes undetected.

🟠 MEDIUM New entrypoint is outside every typecheck surface — zero TS coverage — bench/swe-self-improve.mts

bench/tsconfig.json:10 has "include": ["src//*.ts", "src//*.mts"], but this file lives at bench/swe-self-improve.mts (root, not under src/). It is the ONLY .mts/.ts at bench/ root (confirmed via ls — every other runnable script like commit0-env-run.mts, humaneval-gate.mts lives under bench/src/ and IS covered). Additionally bench/package.json has no typecheck script, and root .github/workflows/ci.yml:35 runs pnpm run typecheck which is root tsc (tsconfig.json include=["src"]) + tsconfig.examples.json (include=["examples"]) — grep of .github/workflows for 'bench' returns nothing. So no CI path typechecks this file. Impact: the file wires 8+ substrate APIs (runStrategyEvolution, runAgentic, createChatClient, refine, sample, EvolutionReport.verdict, AgenticRunResult, AgenticOptions); an

🟠 MEDIUM No error resilience: single task failure loses entire arm — examples/ablation-suite/ablation.ts

Lines 108-128: The inner for (const t of tasks) loop has no try/catch around await runAgentic(...). If any single task run throws (network error, API quota, transient failure), the entire arm's accumulated data (prior tasks' resolve, tokens, $, latency) is silently lost. In a cost-aware framework where a full ablation run can cost non-trivial router credits, losing 5-of-6 completed task results on a network hiccup on the last task burns money with zero data returned. Wrap runAgentic in a try/catch, record the failing task, and continue to remaining tasks to salvage a partial-arm result.

🟠 MEDIUM biome lint/format CI will fail on this file (2 errors + 1 warning) — examples/ablation-suite/ablation.ts

Ran biome check examples/ablation-suite/ablation.ts against the repo's biome.json (assist.source.organizeImports='on', formatter.enabled=true, style.useTemplate='warn'). Exit=1. Concrete violations: (1) assist/source/organizeImports at line 16 — the @tangle-network/agent-runtime/loops import block lists refine, runAgentic, sample, sampleThenRefine, type Strategy out of biome's sort order (types before values). (2) formatter — unwiredKnobs: Array<{...}> literal on line 53, the worker: {...} inline type on [line

🟠 MEDIUM read_file has no path-traversal guard (write_file does) — examples/self-improving-coder/self-improving-coder.ts

Line 135: readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8').slice(0, 8000). With args.path = '../../../etc/passwd', join('/tmp/sic-X', '../../../etc/passwd') normalizes to /etc/passwd and returns up to 8000 chars to the worker — which then persists in the trajectory/logs. write_file at line 142 DOES guard (p.includes('..') || p.startsWith('/')), making the asymmetry conspicuous. Mitigating factor: host pytest already executes agent-authored Python with no sandbo

🟠 MEDIUM read_file tool allows path traversal out of workspace — examples/self-improving-coder/self-improving-coder.ts

Line 135: readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8') — the read_file tool handler resolves agent-supplied paths through path.join without validating the result stays within the workspace. join normalizes .. components, so an LLM-supplied path like ../etc/passwd escapes to /tmp/etc/passwd (confirmed via repro: path.join('/tmp/sic-abc', '../etc/passwd')/tmp/etc/passwd). Meanwhile, write_file at line 142 correctly guards with p.includes('..') an

🟠 MEDIUM run_tests handler contradicts the no-iterate-until-green design invariant — examples/self-improving-coder/self-improving-coder.ts

Lines 125-126 comment: 'NO run_tests: the agent cannot iterate-until-green. It must implement correctly from READING the tests — which creates real headroom and makes the STRATEGY (planning, multiple attempts) matter.' But call() at lines 151-154 STILL handles run_tests, executing pytest and returning pytest: K/N passed. The tool is hidden from tools() so a compliant worker won't see it, but LLM workers routinely hallucinate plausible tool names — and 'run_tests' is

🟡 LOW Absolute-path check in safe() is dead code — bench/src/swe-bench-env.ts

Lines 68-71. The regex ^\.?\/' on line 69 strips a leading optional-dot-then-slash, so by line 70 n can never start with //etc/passwd becomes etc/passwd (silently treated as repo-relative). The n.startsWith('/') guard is unreachable. Result is fail-closed by accident (resolves under ws.dir which doesn't exist), but the security check is misleading. Fix: drop the dead check and document that abs

🟡 LOW Module-global workspaces Map couples concurrent environments — bench/src/swe-bench-env.ts

Line 31. const workspaces = new Map<string, Ws>() lives at module scope, so two createSweBenchEnvironment() calls in the same process (e.g., a test matrix or parallel bench runs) share one registry. Handle.id is the tempdir path (mkdtempSync-unique) so collisions are effectively impossible and this is not a correctness bug today, but it defeats environment isolation and makes the surface stateful in a surprising place. Fix: move the Map inside createSweBenchEnvironment so each environment owns its own registry (closes over it).

🟡 LOW hardcoded split:'test' — no parameter to select train/dev/Verified splits — bench/src/swe-bench-env.ts

Line 42: adapter.loadTasks({ limit: poolN, split: 'test' }) — the split is hardcoded. SWE-bench_Verified exposes a 'train' split (unverified instances, suitable for development/calibration without touching the evaluation set). For the calibration mode in swe-self-improve.mts, using 'test' instances for calibration burns evaluation data. The comment acknowledges this is 'the Verified split' by design, but exposing a parameter (or a separate 'calibrate on train, evolve on test' path) would improve eval hygiene. Currently documented, so low severity.

🟡 LOW isTestPath is a soft hint, not a hard boundary; doc overstates protection — bench/src/swe-bench-env.ts

Line 25 + header lines 8-9. The regex /(^|\/)(tests?)\// + test_.*\.py$|_test\.py$|conftest\.py$/ misses real test-adjacent paths: testing/ dirs (numpy/pandas use them), pytest.ini, setup.cfg, tox.ini, pytest_*.py. The header says 'edit_file refuses test files' as if it were a hard anti-cheat; in reality the real protection is that the official harness re-applies the gold test_patch on top of the agent's patch (swe-bench.ts:120-150), so editing tests is self-defeating anyway. Fix: soften the comment to 'best-effort hint'

🟡 LOW isTestPath regex false positive on non-test directories — bench/src/swe-bench-env.ts

Line 25: /(^|/)(tests?)// matches any path containing a 'tests' or 'test' directory component. This blocks legitimate source directories like 'testsupport/', 'tests-fixtures/', 'tests-common/', 'test_helpers/' (if structured as a package). SWE-bench repos are diverse; a repo like pytest or Django may have source in such paths. The regex should be /(^|/)(tests?)// and NOT match if the component has additional characters beyond 'test'/'tests' — i.e., /(^|/)(tests?)$/|(^|/)tests?// combined with checking the full path segment. In practice this is unlikely to matter for most SWE-bench instances, but it's a correctness risk worth tightening.

🟡 LOW read_file loads entire file into memory before truncating — bench/src/swe-bench-env.ts

Line 105. readFileSync(join(ws.dir, p), 'utf8') reads the whole file, THEN truncates to 24_000 chars on line 106. A pathological multi-GB file in a repo could OOM the bench process before the truncation runs. Low probability for SWE-bench source trees but cheap to harden: statSync first and reject above a size cap, or stream+truncate. Same applies to edit_file reading content on line 119.

🟡 LOW .swe-run-* temp dirs not in .gitignore — bench/swe-self-improve.mts

Line 41: mkdtempSync(join(process.cwd(), '.swe-run-')) creates temp dirs under the repo root with prefix .swe-run-. Neither the root nor bench .gitignore contains a pattern for .swe-run-*. If the process crashes before line 60's rmSync, the directory remains as untracked in the working tree. Fix: add .swe-run-* to .gitignore.

🟡 LOW CALIBRATE env-flag and numeric env parsing are not validated — bench/swe-self-improve.mts

Line 25 if (process.env.CALIBRATE) is truthy for ANY non-empty string including '0', 'false', 'no' — a user exporting CALIBRATE=0 to disable it still gets the calibrate path. Lines 22/26/45-46/55-57 use Number(process.env.X ?? default) with no NaN/integer guard: INNER_TURNS='' → Number('')=0 (agent gets zero inner turns, silently no-ops); INNER_TURNS='abc' → NaN propagates into the budget math. Conventional for env-gated scripts and tsx will still run, but a one-line assertion (e.g. Number.isFinite(innerTurns) && innerTurns>0) would fai

🟡 LOW outDir leaked on runStrategyEvolution failure — bench/swe-self-improve.mts

Line 41: outDir = mkdtempSync(...) creates a temp directory. Line 42-59: runStrategyEvolution runs inside await. If it throws, rmSync(outDir, ...) on line 60 never executes — the temp dir is never cleaned up. The .catch() on line 76 only logs the error, no cleanup. Fix: wra

🟡 LOW rmSync cleanup skipped on throw; temp dir leaks under CWD — bench/swe-self-improve.mts

Line 41 creates outDir via mkdtempSync(join(process.cwd(), '.swe-run-')) and line 60 rmSync's it after runStrategyEvolution resolves. If runStrategyEvolution rejects (e.g. pool exhausted, router auth failure, OOM), control jumps straight to the main().catch on line 76 and rmSync never runs — the .swe-run-XXXXXX dir (with authored strategy .ts modules) is left in the user's CWD. Not a correctness bug and outDir is

🟡 LOW Equal-compute claim across topologies is loose — budget means structurally different things per strategy — examples/ablation-suite/ablation.ts

Comment at the topology knob (line 23) calls budget the 'equal-compute unit (refine: max shots; fanout: rollout width)'. Verified against strategy.ts: refine(budget=N)→depthStrategy with maxShots=N (analyst feedback between each); sample(budget=N)→breadthStrategy with width=N (N independent rollouts, no analyst); sampleThenRefine(budget=N)→⌈N/2⌉ explore + (N-⌈N/2⌉) refine-with-critique. So budget=2 across arms = {2 depth shots w/ analyst, 2 parallel samples no analyst, 1 sample+1 refine}. These are shots-comparable but NOT tokens-comparable (analyst adds completions). NOT a code defect — the autopsy prints real costUsd/tokensIn/tokensOut per arm and Δ$ vs

🟡 LOW No test coverage for exported functions — examples/ablation-suite/ablation.ts

runAblation and printAutopsy are exported public functions with logic-bearing code (knob validation via unwiredKnobs gate, result aggregation, pairedBootstrap display). Zero test coverage. While this is in examples/, other example modules (e.g. coding-benchmark/) have tests. The unwired-knob validation in particular would benefit from a unit test: confirm that setting each declared-but-unwired knob throws with the expected primitive name.

🟡 LOW pairedBootstrap statistic/field mismatch in printAutopsy — examples/ablation-suite/ablation.ts

Line 162 passes statistic: 'mean' to pairedBootstrap but line 164 reads b.median (not b.mean) as the point estimate. pairedBootstrap returns both fields regardless of the statistic parameter, so this does not crash. However, the 95% CI bounds (b.low, b.high) are computed for the mean-difference bootstrap distribution, while the reported point estimate is the median of that distribution. The two can differ materially for small-n skewed data (this file defaults to HOLDOUT_N=6). The table header says 'Δresolve

🟡 LOW Redundant mkdirSync before write — examples/self-improving-coder/self-improving-coder.ts

Line 144: mkdirSync(ws.dir, { recursive: true }). ws.dir was already created by mkdtempSync(join(tmpdir(), 'sic-')) in open() (line 113) and is registered in the workspaces Map. Calling mkdirSync on it is a no-op. Dead call — remove.

🟡 LOW Workspace temp dirs leak on any non-close exit path — examples/self-improving-coder/self-improving-coder.ts

Workspaces live in a module-level Map<string, Ws> (line 89) and are deleted only when close(handle) runs (line 163-168). On process crash, uncaught throw in runStrategyEvolution, or a worker-timeout path that doesn't reach close, the /tmp/sic-* dirs (containing agent-authored lib.py) persist indefinitely. The main() catch at [lines 284-286](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/examples/self-improving-coder/self-

🟡 LOW examples/README.md does not list self-improving-coder — examples/self-improving-coder/self-improving-coder.ts

examples/README.md table goes through entry #18 (self-improving-loop) and does not mention self-improving-coder. The README is explicitly 'A learning path. Read the examples in order' — a new example not in the path is documentation drift. Add a row to the table and a Quickstart line if desired (this is the RSI spine composed on the substrate, a meaningful showcase).

🟡 LOW outDir temp directory leaks on runStrategyEvolution failure — examples/self-improving-coder/self-improving-coder.ts

Line 239: outDir = mkdtempSync(join(process.cwd(), '.sic-run-')) creates the directory. Line 258: rmSync(outDir, ...) cleans up on success. But if runStrategyEvolution throws (line 240), control jumps to the catch at [line 284](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/exam

🟡 LOW run_tests handled in call but not advertised in tools — inconsistent surface — examples/self-improving-coder/self-improving-coder.ts

Lines 151-153 handle run_tests in call(), returning pytest results. But tools() (lines 120-127) does not include run_tests in the returned tool definitions — intentionally, per the comment at line 125-126. The system prompt at [line 178-179](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd4

🟡 LOW write_file path validation is vestigial — the validated path is never used — examples/self-improving-coder/self-improving-coder.ts

Lines 140-146: const p = String(args.path ?? ''); if (!p.endsWith('lib.py') || p.includes('..') || p.startsWith('/')) return 'ERROR: only lib.py is writable'; ... writeFileSync(join(ws.dir, 'lib.py'), String(args.content ?? '')). The validation runs on p (the agent-supplied path) but the write at line 145 hardcodes 'lib.py' and ignores p entirely. So a request with path: 'subdir/lib.py' passes validation and silently writes to the workspace root, while path: 'foo.py'


tangletools · 2026-06-28T04:50:16Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — bd127783

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T04:50:16Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 6e747231

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T05:18:28Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 148.4s (2 bridge agents)
Total 148.4s

💰 Value — sound

Adds contamination-proof coding AgenticSurfaces (generated-lib + SWE-bench Verified) and wires them into the existing runStrategyEvolution spine — all substrate-native, no reinvention, coherent with the codebase's grain.

  • What it does: Adds four new files across two layers: (1) examples/self-improving-coder/self-improving-coder.ts — a contamination-proof coding AgenticSurface (seed-derived wire-protocol library, graded by real pytest, with a $0 calibration gate) wired into runStrategyEvolution; (2) examples/ablation-suite/ablation.ts — a cost-aware one-knob-delta runner that sweeps self-improvement knobs (topology, budget) over
  • Goals it achieves: 1. Demonstrate the RSI (recursive self-improvement) spine on a real, contamination-proof coding domain — not a toy counter — so the substrate's self-improvement primitives are exercised on tasks a model cannot memorize. 2. Provide a calibration gate that proves the task is solvable AND the grader discriminates (reference→100%, stub→0%) — enforcing calibrate-before-measure before spending compute.
  • Assessment: The change is well-built and follows the codebase's grain perfectly. Every new component uses existing substrate primitives: runStrategyEvolution (src/runtime/strategy-evolution.ts), AgenticSurface (src/runtime/strategy.ts:76), the existing createSweBenchAdapter (bench/src/benchmarks/swe-bench.ts), and createChatClient from agent-eval. The SWE-bench Environment delegates scoring to the existing ad
  • Better / existing approach: none — this is the right approach. Checked: examples/strategy-evolution/ (toy counter domain, different purpose), examples/self-improving-loop/ (offline scripted demo, different paradigm), examples/coding-benchmark/ (runProfileMatrix + firewall paradigm, not AgenticSurface), bench/src/benchmarks/swe-bench.ts (adapter-only, not an AgenticSurface). No existing equivalent for any of these four new fi
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode produced no stream output: ; opencode/zai-coding-plan/glm-5.2: opencode produced no stream output:

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: opencode produced no stream output: ; opencode/kimi-for-coding/k2p7: opencode produced no stream output: ; opencode/deepseek/deepseek-v4-pro: opencode produced no stream output:

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/swe-self-improve.mts

  • console.log(═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)

🟡 Cruft: magic number added bench/swe-self-improve.mts

  •  console.log(`  ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T052247Z

@drewstone drewstone merged commit ee60796 into main Jun 28, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 28, 2026
LOOP LAYERS (on the analyze keystone):
- surface-worker.ts: graded-worker seam — a makeWorkerAgent running runAgentic over an AgenticSurface,
  settling with the surface score as the deliverable verdict (driver spawns/steers graded workers).
- gepa-driver-prompt.ts: optimize the driver compose-prompt on TRAIN with an EXECUTABLE JudgeConfig
  (selfImprove + gepaProposer, reading the surface score — not an LLM judge), frozen, certified.
- self-improving-supervisor.ts: one-call DX composing supervise(+analyze) over the graded seam.
- ablation.ts: driverSteer + optimize knobs WIRED (blind/steered/self-improving) over the same
  supervise(); per-task try/catch resilience; cost+significance autopsy intact.
- index.ts: re-export AnalystRegistry + MakeWorkerAgent from the loops barrel (host authors its seam).

#402 HARDENING:
- swe-bench-env: temp-dir leak fix (try/catch+rmSync on clone/checkout fail), symlink realpath-jail
  (read+edit), workspaces Map into the closure, exported jailPath/isTestPath + unit tests.
- self-improving-coder: read_file path guard (symmetric with write_file), dead run_tests handler gone.
- swe-self-improve.mts moved to bench/src/ (now under typecheck).

Typecheck clean (examples + core); biome clean; 82 supervise tests green on the keystone.
@drewstone

Copy link
Copy Markdown
Contributor Author

Folded into #404 (the comprehensive self-improving-supervisor PR) — all the examples + instrument here, plus the review findings from this PR addressed there (temp-dir leak, symlink realpath-jail, read_file guard, run_tests removal, per-task resilience, biome, typecheck coverage, swe-bench-env unit tests). Closing in favor of #404.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants