feat(examples): self-improving-coder — the RSI spine, composed cleanly, contamination-proof#402
Conversation
…y, on a contamination-proof task The pristine self-improvement loop with NOTHING hand-rolled: an AgentProfile-shaped worker over an AgenticSurface (the task, real tools), gated by runStrategyEvolution — which authors strategies from TRAIN losses then makes ONE promotion decision on a FRESH holdout slice via promotionGate. Adaptive data analysis is structurally impossible (disjoint task offsets, holdout read once). The only new code is the Environment: a contamination-proof generated coding task (constants derived per-seed, so no model could have memorized it), graded by real pytest. $0 calibration self-check (reference->100%, stub->0%) gates spend. The bundled task is deliberately simple — a capable model aces it, so the gate correctly returns no-promotion; swap a harder Environment (or SWE-bench) for a discriminating run.
…-cheating frontier run createSweBenchEnvironment: the agent clones the repo at base_commit, explores + makes SURGICAL edits via tools (edit_file, source-only, test files path-jailed), and score() grades the git diff with the official swebench Docker harness. The substrate drives the agentic loop (runAgentic / runStrategyEvolution) — no hand-rolled tool-loop. Never sees the hidden tests or the gold patch. swe-self-improve.mts wires it into runStrategyEvolution with a disjoint train/holdout split (the substrate enforces freeze + one holdout decision — no adaptive reuse). CALIBRATE mode runs the baseline on a few bugs first (cost gate). CONTAMINATION CAVEAT documented: public fixes may be memorized; report it, never claim a clean frontier number from this arena alone.
…sist-and-edit prompt Calibration showed gemini-2.5-pro returning empty (no tool calls) without a maxTokens cap, then stopping after ~3 turns without editing. Set worker maxTokens=8000 and a prompt that forces broad exploration + at least one edit_file attempt. Log completions/shots in CALIBRATE mode for headroom diagnosis.
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 5 (1 medium-concern, 2 low, 2 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 174.2s (2 bridge agents) |
| Total | 174.2s |
💰 Value — sound-with-nits
Adds a tool-editing SWE-bench AgenticSurface + an evolution runner (sound, substrate-composed) plus a third self-improvement example whose spine duplicates the canonical one and whose task is by-design saturated.
- What it does: Three files. (1) bench/src/swe-bench-env.ts: wraps the existing createSweBenchAdapter judge behind a NEW AgenticSurface where the agent clones the repo at base_commit, explores/edits SOURCE via list_files/read_file/edit_file (test files path-jailed), and score() runs
git diffthrough the official swebench Docker harness; a disjoint-slice task supplier keys tasks by dataset offset so train/holdou - Goals it achieves: (a) Give SWE-bench Verified a no-cheating agentic surface — the agent makes surgical edits like a real engineer and is graded by the official harness, never seeing the gold patch or hidden tests. (b) Provide the frontier self-improvement run (evolve strategies on real bugs, promote on a frozen holdout). (c) Demonstrate the substrate's RSI spine on a coding-flavored task that is structurally imposs
- Assessment: Good change on the bench side. The SWE-bench env does NOT duplicate the existing swe-bench adapter: the existing path emits a diff as text (bench/src/benchmarks/swe-bench.ts:34 swePatchOutput); the new path edits files via tools and scores
git diff— a materially different agentic surface, and there is no existing swe-bench evolution runner to extend. CALIBRATE is a sensible cost gate and the ru - Better / existing approach: Looked at examples/strategy-evolution/, examples/strategy-suite/, examples/self-improving-loop/, bench/src/commit0-env.ts, bench/src/commit0-env-run.mts, and bench/src/benchmarks/swe-bench.ts. For the bench files: no better approach — the tool-editing SWE-bench surface is net-new and the runner is the right primitive. For the example: a materially cleaner home exists. examples/strategy-suite/count
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error
🎯 Usefulness — sound-with-nits
The SWE-bench-as-edit-via-tools AgenticSurface is a genuinely new, well-fitting capability; the companion example is a 4th self-improvement composition that overlaps the existing strategy-evolution/ example and isn't wired into the README catalog.
- Integration: Both new files are reachable by direct tsx invocation, matching the established bench-runner pattern (createCommit0Environment/createSweBenchEnvironment are consumed by their sibling *-run.mts, not exported from bench/src/index.ts — see commit0-env.ts:80 vs swe-bench-env.ts:36). Every API surface checks out against the source: AgenticSurface (strategy.ts:76-83), runAgentic returning {resolved,comp
- Fit with existing patterns: The AgenticSurface env is exactly the codebase grain — bench/src/commit0-env.ts is the template (clone at base_commit → path-jailed file tools → run the real test suite in score()), and swe-bench-env.ts mirrors it faithfully including the edit-source-not-tests jail. The runner (swe-self-improve.mts) follows commit0-env-run.mts's shape. The example, however, is the repo's 4th self-improvement compo
- Real-world viability: Adequate beyond the happy path. swe-bench-env uses partial clone (--filter=blob:none --no-checkout) with generous timeouts for large repos; per-task tmpdirs key the module-global workspaces Map so concurrent tasks don't collide; score() degrades gracefully (empty diff → 0/1, judge throw → errored). The example's pytest runs on host (documented as a Docker swap for untrusted code) which is fine for
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added bench/swe-self-improve.mts
- console.log(
═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)
🟡 Cruft: magic number added bench/swe-self-improve.mts
console.log(` ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
💰 Value Audit
🟠 self-improving-coder example re-teaches the runStrategyEvolution spine already in examples/strategy-evolution/ [duplication] ``
examples/self-improving-coder/self-improving-coder.ts:118-286 wires runStrategyEvolution + a disjoint (offset,n) task supplier + the holdout gate — structurally the same code as examples/strategy-evolution/strategy-evolution.ts:58-88 (compare the two main()s side by side). The example set already has TWO self-improvement demos (strategy-evolution, self-improving-loop). The PR body concedes the bundled task is deliberately saturated so the gate returns no-promotion, so as a self-improvement demon
🟡 swe-bench-env.ts duplicates the git-repo-editing AgenticSurface shell from commit0-env.ts [duplication] ``
bench/src/swe-bench-env.ts and bench/src/commit0-env.ts share the same skeleton: a module-level
workspaces: Map<string, Ws>, open() that mkdtempSyncs + git clones + checks out base_commit (swe-bench-env.ts:48-57 vs commit0-env.ts:88-90), a tools() returning list_files/read_file/edit-or-write_file, a call() with an inlinesafe()/jail()path normalizer rejecting '..' and absolute paths, and a close() with rmSync. The list_files bounded-recursive walker (swe-bench-env.ts:74-92) is also hand-r
🎯 Usefulness Audit
🟡 self-improving-coder example overlaps strategy-evolution/ and is absent from the README learning path [problem-fit] ``
examples/README.md Tier 5 already catalogs three self-improvement examples (#16 strategy-evolution, #17 improve, #18 self-improving-loop). The new examples/self-improving-coder/self-improving-coder.ts is a near-structural twin of #16 (same runStrategyEvolution config shape, same disjoint-slice tasks() supplier at :172, same [sample, refine] baselines, same saturated-toy-returns-no-promotion outcome the README itself describes at strategy-evolution/README.md:32). Its genuine contributions — the c
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 62 | 48 | 48 |
| Confidence | 75 | 75 | 75 |
| Correctness | 62 | 48 | 48 |
| Security | 62 | 48 | 48 |
| Testing | 62 | 48 | 48 |
| Architecture | 62 | 48 | 48 |
Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Error handler drops stack trace, inconsistent with other bench runners — bench/swe-self-improve.mts
Line 82:
console.error(e)prints onlye.toString(), discarding the stack trace. Every other bench runner in this repo prints the full stack: commit0-env-run.mts:59 usese.stack ?? e.message, trata-gate.mts:241 useserr.stack ?? err.message. This makes CI/debugging failures significantly harder — a TypeError inside runStrategyEvolution would show only 'TypeError: ...' with no line numbers. Fix:main().catch((e) => { console.error(e instanceof Error ? (e.stack ?? e.message) : String(e)); process.exit(1); })
🟠 MEDIUM read_file lacks path traversal protection that write_file has — examples/self-improving-coder/self-improving-coder.ts
Line 135:
readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8')has no validation. Line 142:write_filecorrectly guards withp.endsWith('lib.py') || p.includes('..') || p.startsWith('/'). The agent can read arbitrary files viaread_file('../../etc/passwd'). In this no-Docker example, Python code executed via execFileSync already has filesystem access, so this is not an escalation — but the asymmetry is misleading for users who copy this pattern assuming both file ops a
🟡 LOW No unit tests for the new env's integrity guards — bench/src/swe-bench-env.ts
No .test. covers swe-bench-env.ts (grep for swe-bench-env/createSweBenchEnvironment across test specs: empty). The path-jail safe(), the isTestPath gate, the edit_file single-occurrence rule, and the empty-diff vs errored score() branches are pure-ish, cheat-prevention-flavored logic that a regression test would pin cheaply. Sibling commit0-env.ts also lacks unit tests, so this is repo-conventional — but the SWE env leans harder on these guards for its 'no cheating' claim, raising the value of a test. Fix: add a small test exercising safe()/isTestPath/score()-on-empty-diff (the functions would need to be exported or tested via the surface contract).
🟡 LOW REJECTED responses not counted as toolErrors in runShot — bench/src/swe-bench-env.ts
Line 114:
edit_filereturnsREJECTED: editing test files is forbidden...when the path matchesisTestPath(). TherunShotfunction (src/runtime/strategy.ts:168) only incrementstoolErrorsfor responses starting withERROR:—REJECTED:responses are silently counted as successful tool calls. This understates the tool-error count in benchmark statistics. Impact: minor observability gap; the agent still receives clear rejection feedback, so decision-making is not affected.
🟡 LOW Tmpdir leaks when git clone/checkout fails in open() — bench/src/swe-bench-env.ts
Lines 50-52: mkdtempSync runs BEFORE the git clone (420s timeout) and checkout (300s timeout). If either throws (network blip, unreachable base_commit, GitHub rate-limit, timeout), the created tmpdir is never removed — no try/catch cleanup around the post-mkdtemp work. The sibling commit0-env.ts:94-99 shows the intended pattern (clean up dir + container on setup failure). Impact: orphaned dirs under os.tmpdir() accumulate across flaky benchmark runs; not a correctness issue but a resource leak on the long-running multi-task loop. Fix: wrap the clone+checkout in try/catch and rmSync(dir,{recursive:true,force:true}) on rethrow.
🟡 LOW adapter.judge() in score() has no timeout — Docker harness can hang indefinitely — bench/src/swe-bench-env.ts
Line 143-145:
adaptor.judge(ws.task, patch)is awaited without a timeout. The adapter delegates torunStagedJudge()(swe-bench.ts:122) which passes notimeoutMs, soexecFileAsync(node:child_process) runs withtimeout: undefined= no timeout. The SWE-bench Docker harness can pull images, build containers, and run test suites — any hang (Docker daemon stall, network timeout pulling images, test infinite loop) blocks the entire eval run. The git diff call at line 137 correctly has a 60s timeout, making this inconsistency mo
🟡 LOW adapter.preflight() is never invoked — misconfigured env surfaces as silent all-errored run — bench/src/swe-bench-env.ts
createSweBenchEnvironment constructs the adapter (line 47) but never calls adapter.preflight(), and the sole caller bench/swe-self-improve.mts doesn't either. The adapter defines preflight() (benchmarks/swe-bench.ts:62-70) specifically to emit actionable guidance ('pip install swebench; ensure Docker is running'). Without it, a missing venv or stopped Docker daemon makes every score() call throw inside adapter.judge, caught at line ~140, returning {passes:0,total:1,errored:1} per task — the whole run silently scores zero with no 'install swebench / start Docker' message. Impact: confusing first-run experience; operator must infer infra absence from universal 'erro
🟡 LOW isTestPath anti-cheat is a heuristic; doc claims 'by construction' — bench/src/swe-bench-env.ts
Line 16 regex: /(^|/)(tests?)// + /test_..py$|test.py$|conftest.py$/ blocks the common Python conventions (test/, tests/, test.py, *_test.py, conftest.py). The header doc comment (lines 1-12) asserts 'No cheating by construction: ... edit_file refuses test files.' The regex misses edge layouts: a file literally named 'test.py' at root (no dir slash, no underscore), a 'testing/' dir, or non-Python test files. Scoring integrity still holds because the official SWE-bench harness re-applies test_patch on top of the model patch (go
🟡 LOW list_files follows symlinks via statSync, bypassing workspace boundary — bench/src/swe-bench-env.ts
Line 90:
statSync(p).isDirectory()follows symlinks. If a cloned repo contains a symlink to a directory outside the workspace (e.g.,/etc),walk()would recursively traverse into it, exposing files beyond the workspace. Thesafe()path check at line 68-71 only guards against..traversal and absolute paths, not symlink-following. Fix: uselstatSync(p).isDirectory()to avoid following symlinks, or add a realpath-bounded check. Impact: low — SWE-bench repos are trusted public repos with no known malicious symlinks.
🟡 LOW CALIBRATE env var uses truthy check instead of explicit comparison — bench/swe-self-improve.mts
Line 28:
if (process.env.CALIBRATE)is truthy for any non-empty string including '0', 'false', 'no'. The header comment at line 7 documentsCALIBRATE=1as the intended activation value. Someone settingCALIBRATE=0orCALIBRATE=falseexpecting to disable calibration would get a surprise. Fix: compare explicitly:process.env.CALIBRATE === '1'
🟡 LOW Champion scores omitted from final report output — bench/swe-self-improve.mts
Lines 72-73 print
gen0 champion:andfinal champion:with only names, while the EvolutionReport carries the scores (gen0Champion.score, finalChampion.score on scale 0–1). The held-out lift is reported (line 79), but without the gen0 champion's holdout score the reader cannot assess the baseline vs final gap. The commit0-env-run.mts analog uses printBenchmarkReport which includes per-strategy scores. Adding the scores here would make the output self-contained for a reader who didn't run the calibration step. Fix: include `report.ge
🟡 LOW Script lives outside CI typecheck scope — bench/swe-self-improve.mts
bench/tsconfig.json
include: ["src/**/*.ts", "src/**/*.mts"]excludes top-level bench/*.mts; root CI runstsc --noEmit(src only) +tsc -p tsconfig.examples.json(examples only); bench/package.json has no typecheck script. This 79-line file is never statically checked — type drift (e.g., a renamed field on StrategyEvolutionConfig) would only surface at runtime via tsx, which skips typecheck. Manually verified all types match as of this commit, so not blocking, but addingbench/swe-self-improve.mts(and any sibling top-level scripts) to bench/tsconfig.json include would prevent silent breakage on future substrate renames.
🟡 LOW .swe-run-* missing from bench/.gitignore — bench/swe-self-improve.mts
mkdtempSync(join(process.cwd(), '.swe-run-'))writes into the bench/ cwd, but bench/.gitignore (11 lines) has no.swe-run-*/entry (it ignores.tmp-e3-*/,run-artifacts/, etc.). Combined with finding 1, a failed run leaves untracked dirs that surface ingit statusand risk being accidentally committed. Fix: add.swe-run-*/to bench/.gitignore.
🟡 LOW outDir cleanup skipped on throw — leaks .swe-run-* on failure — bench/swe-self-improve.mts
Lines 46-58:
rmSync(outDir, { recursive: true, force: true })runs ONLY on the success path afterrunStrategyEvolutionreturns. If the multi-hourrunStrategyEvolutionthrows (e.g., holdout phase, author failure, OOM), control jumps tomain().catch()which callsprocess.exit(1)— the rmSync never runs and the.swe-run-XXXX/dir full of authored strategy .ts files persists on disk. Fix: move rmSync into afinallyblock around the runStrategyEvolution call, OR document that the leak-on-failure is intentional for post-mortem inspection. Not blocking — bench-only, single user.
🟡 LOW DUMP and CALIBRATE env checks use truthiness, not specific values — examples/self-improving-coder/self-improving-coder.ts
Line 227:
if (process.env.CALIBRATE)—CALIBRATE=0orCALIBRATE=falsewould still trigger calibration (non-empty strings are truthy). Line 261: same pattern for DUMP. These are minor for an example script but could surprise users. Fix: compare against specific values:if (process.env.CALIBRATE === '1').
🟡 LOW Host pytest on model-written lib.py (no sandbox) — examples/self-improving-coder/self-improving-coder.ts
execFileSync('python3', [...], { cwd: dir })at line 94 imports and runs the agent-authoredlib.pydirectly on the host (line 145 writes whatever the model produced). The comment at line 84 acknowledges this (Docker is a swap for untrusted code) — appropriate for a deliberate example, but anyone copy-pasting the canonical exam
🟡 LOW Orphaned .sic-run-* directories on evolution failure — examples/self-improving-coder/self-improving-coder.ts
Line 239:
outDir = mkdtempSync(join(process.cwd(), '.sic-run-'))— created in project root. Line 258:rmSync(outDir, ...)only runs ifrunStrategyEvolutionsucceeds. If it throws, the temp directory persists in the project tree with no OS cleanup (unlike /tmp). Since the process exits on error (line 284-285), the actual
🟡 LOW Number(process.env.X ?? default) silently accepts empty string — examples/self-improving-coder/self-improving-coder.ts
Lines 243-255 parse
TRAIN_N/HOLDOUT_N/BUDGET/GENERATIONS/POP/INNER_TURNSviaNumber(process.env.X ?? default). If a user exportsTRAIN_N=(empty),Number('')returns0, not the default — silently producing a zero-task train slice. Fix:Number(process.env.X || default)orNumber(process.env.X ?? default) || default. Trivial footgun, example-only.
🟡 LOW outDir leaks on runStrategyEvolution throw — examples/self-improving-coder/self-improving-coder.ts
Line 239 creates
outDir = mkdtempSync(join(process.cwd(), '.sic-run-'))and line 258 cleans it withrmSync(outDir, { recursive: true, force: true }). IfrunStrategyEvolution(lines 240-257) throws — router outage, author syntax error, benchmark crash — the cleanup never runs and.sic-run-*accumulates under the projec
🟡 LOW run_tests handler exists despite the NO run_tests comment — examples/self-improving-coder/self-improving-coder.ts
Lines 125-126 comment says
NO run_tests: the agent cannot iterate-until-green, andtools()(lines 120-128) correctly omits it from the manifest. Butcall()at lines 151-154 still handlesname === 'run_tests'and returns live pytest results. TheAgenticSurfacecontract (src/runtime/strategy.ts:76-83) does
🟡 LOW codingEnv.tools() takes no params but interface expects (task, handle) — examples/self-improving-coder/self-improving-coder.ts
Line 120:
async tools()takes no parameters. TheAgenticSurfaceinterface at strategy.ts:79 declarestools(task: AgenticTask, handle: ArtifactHandle): Promise<AgenticTool[]>. TypeScript allows fewer params to satisfy the interface (structural compatibility), and the current tools are static (no task/handle needed). But iftools()ever becomes dynamic per-handle, this silently breaks. Fix: accept the params (even if unused):async tools(_task: AgenticTask, _handle: ArtifactHandle)for forward-compatibility.
🟡 LOW run_tests handler exists but tool is deliberately not exposed — examples/self-improving-coder/self-improving-coder.ts
Lines 125-126: comment says 'NO run_tests: the agent cannot iterate-until-green.' Lines 151-153:
call()handlesrun_testsand returns real pytest results. If a generated strategy or hallucinating model invokesrun_tests, the no-iteration constraint is silently defeated. The handler returning real results contradicts the stated intent. Fix: either remove the run_tests handler entirely (return error), or explicitly document it as a debug-only escape hatch.
tangletools · 2026-06-28T02:39:10Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 22 non-blocking findings — ed844764
Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T02:39:10Z · immutable trace
The instrument for 'what actually helps': a configurable agent where each self-improvement technique is a knob (topology/trace-analysis/steering/GEPA-skillopt/persistent-artifact), swept one-knob-at-a-time (O(N) not 2^N) at equal compute, with a full autopsy — resolve AND token/$/latency per arm — so we see what helps vs what just burns tokens. WIRED: topology (refine/sample/sampleThenRefine) + budget. The rest are DECLARED knobs that FAIL LOUD if set (no silent no-op — names the substrate primitive to wire). Exports codingEnv/codingTasks from self-improving-coder (guarded main) for the cheap validation fixture.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — fb6f682a
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T03:31:34Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (1 medium-concern, 2 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 1083.5s (2 bridge agents) |
| Total | 1083.5s |
💰 Value — sound-with-nits
Adds the RSI spine as a contamination-proof coding example + a proper SWE-bench AgenticSurface + a frontier self-improvement runner — 3 of 4 files are clean composition over substrate primitives; the ablation file reimplements runBenchmark's wired subset and should be rebuilt on top of it.
- What it does: Four additions. (1) examples/self-improving-coder: an AgentProfile worker over a generated wire-protocol coding Environment, gated by runStrategyEvolution; the bundled task is contamination-proof (constants derived per-seed so no model memorized the fix), with a $0 calibration self-check (reference→all-pass, stub→0). (2) bench/src/swe-bench-env.ts: wraps the existing createSweBenchAdapter (bench/s
- Goals it achieves: (a) Give the substrate a self-improvement example on a task a frontier model provably hasn't memorized (the generated-constants idea), so the honest no-promotion result is trustworthy rather than an artifact of contamination. (b) Provide the real frontier path — SWE-bench Verified driven as an AgenticSurface with surgical source-only edits, scored by the official harness, with the contamination ca
- Assessment: Mostly in the grain. swe-bench-env.ts follows the exact commit0-env.ts precedent (same file shape, same AgenticSurface hooks, reuses the existing createSweBenchAdapter for judging rather than forking it — clean). swe-self-improve.mts mirrors bench/src/examples/strategy-demo.mts's entry-point shape. self-improving-coder is structurally a domain-swap of examples/strategy-evolution/strategy-evolution
- Better / existing approach: For 3 of 4 files: none — right approach, reuses createSweBenchAdapter and the substrate primitives correctly. For the ablation file: a materially better approach exists (build on runBenchmark, below). Searched: bench/src/-env.ts (found commit0-env.ts precedent), bench/src/examples/.mts (found strategy-demo.mts), src/runtime/run-benchmark.ts (found runBenchmark already does cost-aware strategy co
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
All four files implement real capabilities composed from substrate primitives following the exact patterns of existing siblings (commit0-env, strategy-evolution), with no reinvention and no dead surface.
- Integration: All new code wires into the substrate correctly and is reachable.
swe-bench-env.tsimplementsAgenticSurfaceidentically to the siblingcommit0-env.ts:80and is called byswe-self-improve.mts:14— the standard CLI-entrypoint pattern used bycommit0-env-run.mts.self-improving-coder.tsexportscodingEnv+codingTasksconsumed byablation-suite/ablation.ts:24and is a self-containe - Fit with existing patterns: Every new surface follows established patterns.
swe-bench-env.tsmirrorscommit0-env.tsexactly (5-method AgenticSurface + module-levelworkspacesMap + path-jailed tools + external harness scoring).self-improving-coder.tsmirrorsstrategy-evolution/strategy-evolution.ts(AgenticSurface + disjoint task supplier +runStrategyEvolutioncall). The ablation suite's cost-aware comparison ( - Real-world viability: All three
AgenticSurfaceimplementations handle edge inputs: path traversal rejection (..//), file truncation (24k/8k), missing-workspace guards, empty-patch scoring, fail-loud errors on unknown tools/tasks. Theswe-bench-envmirrorscommit0-env's workspace lifecycle but omits Docker root-ownership cleanup (commit0-env.ts:167-169) — this is correct becauseswe-bench-envruns the judg - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added bench/swe-self-improve.mts
- console.log(
═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)
🟡 Cruft: magic number added bench/swe-self-improve.mts
console.log(` ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
💰 Value Audit
🟠 ablation-suite reimplements runBenchmark's wired subset by hand [duplication] ``
examples/ablation-suite/ablation.ts:101 calls runAgentic in a manual per-arm × per-task loop, accumulating score + tokens + $ + latency (the 'cost-aware autopsy' claimed at lines 2-6 and 139-165). But src/runtime/run-benchmark.ts:132 (runBenchmark) ALREADY does this — and more: it takes
strategies(which IS the topology knob — sample/refine/sampleThenRefine), runs them at equalbudget, tracks per-strategy score/resolved/usd/ms (BenchmarkStrategySummary, line 85-93), produces a Pareto frontie
🟡 new bench entries not registered in bench/HARNESS.md (the repo's anti-rediscovery map) [maintenance] ``
bench/HARNESS.md:101 documents commit0-env.ts + its companion runner (commit0-env-run.mts) as the canonical 'HARD domain through runBenchmark' entry — and CLAUDE.md:42 frames a stale map as a defect ('if a map disagrees with the code, the code wins — fix the map in the same turn'). The new swe-bench-env.ts + swe-self-improve.mts are the same kind of entry (a benchmark-as-AgenticSurface + a runner) but are absent from HARNESS.md's command map. The next agent orienting in bench/ will not discover
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…teering knob at the driver-steers-worker loop Adds task-aligned per-task resolve vectors + pairedBootstrap 95% CI on every arm's Δresolve (✓ = CI excludes 0 = real lift) — no more point lifts. Reframes the rich knobs to the RIGHT primitives: the steering knob is the supervise() driver-steers-worker loop (driver composes the next prompt from the analyst's analyzeOnSettle finding — a driver brain in the loop, not the inline analyst-steerer); the optimize knob is selfImprove() with an executable JudgeConfig optimizing the driver's compose-prompt on TRAIN, frozen. Both fail loud until wired.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — bd127783
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T04:15:12Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 135.5s (2 bridge agents) |
| Total | 135.5s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — sound
Four files compose a coherent self-improvement spine: a contamination-proof coding task for runStrategyEvolution, a cost-aware one-knob ablation runner, and SWE-bench wired as an AgenticSurface — all built entirely on substrate primitives, following the established strategy-evolution example p
- Integration: All four files import and compose only substrate primitives from
@tangle-network/agent-runtime/loopsand@tangle-network/agent-eval.self-improving-coderexportscodingEnv/codingTasks, consumed byablation-suite.swe-bench-env.tswraps the existingcreateSweBenchAdapter(bench/src/benchmarks/swe-bench.ts) behind the standardAgenticSurfaceseam — the same seam `runStrategyEvolu - Fit with existing patterns: The pattern — implement
AgenticSurface, supply a disjoint-slice(offset, n) => AgenticTask[]supplier, callrunStrategyEvolution— is the codebase's own architecture, established inexamples/strategy-evolution/strategy-evolution.ts:40-58.self-improving-coderandswe-self-improveare domain-specific instantiations of that pattern (coding + SWE-bench), not competitors.ablation-suite - Real-world viability: Cleanup is guaranteed:
strategy.ts:464-466callssurface.close(handle)in afinallyblock for everyopen(). Path safety is enforced (swe-bench-env.ts:68-71rejects..and absolute paths). Git operations have timeouts (clone 420s, checkout 300s). Score methods catch errors and return{errored:1}rather than crashing. Workspace lifecycle is idempotent (close usesforce:truermSync, o - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added bench/swe-self-improve.mts
- console.log(
═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)
🟡 Cruft: magic number added bench/swe-self-improve.mts
console.log(` ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 2 | 16 | 2 |
| Confidence | 80 | 80 | 80 |
| Correctness | 2 | 16 | 2 |
| Security | 2 | 16 | 2 |
| Testing | 2 | 16 | 2 |
| Architecture | 2 | 16 | 2 |
Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH Resource leak: temp dirs orphaned on git clone/checkout failure in open() — bench/src/swe-bench-env.ts
Lines 51-56: mkdtempSync creates the temp dir BEFORE the two await exec() calls. If git clone (L52) or git checkout (L53) throws, the temp dir is never cleaned up — the workspace is only registered in the Map on L55, so close() will never see it. On a benchmark run with 80 instances, each failed clone leaks a directory (potentially hundreds of MB). Fix: wrap lines 52-56 in try/catch with rmSync(dir, {recursive:true,force:true}) in the catch before re-throwing.
Other
🟠 MEDIUM Symlink-following defeats the path jail in read_file/edit_file — bench/src/swe-bench-env.ts
Lines 68-71, 105, 119, 127.
safe()only inspects the STRING (../ leading/), butreadFileSync(join(ws.dir, p))andwriteFileSync(join(ws.dir, p))follow symlinks. SWE-bench repos are real GitHub checkouts that can contain symlinks (docs/, bin/, build shims). An agent calling read_file on a repo-relative symlink that targets /etc/passwd or ../../../etc/passwd exfiltrates arbitrary host files; edit_file via writeFileSync through such a symlink WRITES to the symlink target — a host-integrity risk, not just read. The file header ([lines 4-9](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/bench/src/swe-bench
🟠 MEDIUM Tempdir leak when git clone/checkout fails in open() — bench/src/swe-bench-env.ts
Lines 51-55.
mkdtempSynccreates the dir first;git clone(420s timeout) andgit checkout(300s timeout) run beforeworkspaces.set(dir, ...)on line 55. If either git op throws (network blip, unknown base_commit, transient HF/Docker error), the function throws, the caller never receives a handle,close()is never called, AND the dir was never registered inworkspaces— so it leaks permanently. Each clone is a full SWE-bench repo (hundreds of MB). A long run with flaky networking can fill /tmp. Fix: wrap in try/catch that c
🟠 MEDIUM Zero test coverage for the entire surface implementation — bench/src/swe-bench-env.ts
Confirmed via glob + grep: no test files exist for swe-bench-env.ts anywhere in the repo. The file implements path safety (safe()), a test-path regex (isTestPath), tool dispatch (3 tools + unknown guard), edit_file dedup/validation logic (count via split, old_string empty check), workspace lifecycle, and the tasks() supplier boundary check. Every one of these should have at least unit coverage. Without tests, a regression in the safety guard or score() error handling goes undetected.
🟠 MEDIUM New entrypoint is outside every typecheck surface — zero TS coverage — bench/swe-self-improve.mts
bench/tsconfig.json:10 has "include": ["src//*.ts", "src//*.mts"], but this file lives at bench/swe-self-improve.mts (root, not under src/). It is the ONLY .mts/.ts at bench/ root (confirmed via ls — every other runnable script like commit0-env-run.mts, humaneval-gate.mts lives under bench/src/ and IS covered). Additionally bench/package.json has no typecheck script, and root .github/workflows/ci.yml:35 runs
pnpm run typecheckwhich is root tsc (tsconfig.json include=["src"]) + tsconfig.examples.json (include=["examples"]) — grep of .github/workflows for 'bench' returns nothing. So no CI path typechecks this file. Impact: the file wires 8+ substrate APIs (runStrategyEvolution, runAgentic, createChatClient, refine, sample, EvolutionReport.verdict, AgenticRunResult, AgenticOptions); an
🟠 MEDIUM No error resilience: single task failure loses entire arm — examples/ablation-suite/ablation.ts
Lines 108-128: The inner
for (const t of tasks)loop has no try/catch aroundawait runAgentic(...). If any single task run throws (network error, API quota, transient failure), the entire arm's accumulated data (prior tasks' resolve, tokens, $, latency) is silently lost. In a cost-aware framework where a full ablation run can cost non-trivial router credits, losing 5-of-6 completed task results on a network hiccup on the last task burns money with zero data returned. WraprunAgenticin a try/catch, record the failing task, and continue to remaining tasks to salvage a partial-arm result.
🟠 MEDIUM biome lint/format CI will fail on this file (2 errors + 1 warning) — examples/ablation-suite/ablation.ts
Ran
biome check examples/ablation-suite/ablation.tsagainst the repo's biome.json (assist.source.organizeImports='on', formatter.enabled=true, style.useTemplate='warn'). Exit=1. Concrete violations: (1) assist/source/organizeImports at line 16 — the@tangle-network/agent-runtime/loopsimport block listsrefine, runAgentic, sample, sampleThenRefine, type Strategyout of biome's sort order (types before values). (2) formatter —unwiredKnobs: Array<{...}>literal on line 53, theworker: {...}inline type on [line
🟠 MEDIUM read_file has no path-traversal guard (write_file does) — examples/self-improving-coder/self-improving-coder.ts
Line 135:
readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8').slice(0, 8000). Withargs.path = '../../../etc/passwd',join('/tmp/sic-X', '../../../etc/passwd')normalizes to/etc/passwdand returns up to 8000 chars to the worker — which then persists in the trajectory/logs.write_fileat line 142 DOES guard (p.includes('..') || p.startsWith('/')), making the asymmetry conspicuous. Mitigating factor: host pytest already executes agent-authored Python with no sandbo
🟠 MEDIUM read_file tool allows path traversal out of workspace — examples/self-improving-coder/self-improving-coder.ts
Line 135:
readFileSync(join(ws.dir, String(args.path ?? '')), 'utf8')— theread_filetool handler resolves agent-supplied paths throughpath.joinwithout validating the result stays within the workspace.joinnormalizes..components, so an LLM-supplied path like../etc/passwdescapes to/tmp/etc/passwd(confirmed via repro:path.join('/tmp/sic-abc', '../etc/passwd')→/tmp/etc/passwd). Meanwhile,write_fileat line 142 correctly guards withp.includes('..')an
🟠 MEDIUM run_tests handler contradicts the no-iterate-until-green design invariant — examples/self-improving-coder/self-improving-coder.ts
Lines 125-126 comment: 'NO run_tests: the agent cannot iterate-until-green. It must implement correctly from READING the tests — which creates real headroom and makes the STRATEGY (planning, multiple attempts) matter.' But
call()at lines 151-154 STILL handlesrun_tests, executing pytest and returningpytest: K/N passed. The tool is hidden fromtools()so a compliant worker won't see it, but LLM workers routinely hallucinate plausible tool names — and 'run_tests' is
🟡 LOW Absolute-path check in safe() is dead code — bench/src/swe-bench-env.ts
Lines 68-71. The regex
^\.?\/'on line 69 strips a leading optional-dot-then-slash, so by line 70ncan never start with/—/etc/passwdbecomesetc/passwd(silently treated as repo-relative). Then.startsWith('/')guard is unreachable. Result is fail-closed by accident (resolves under ws.dir which doesn't exist), but the security check is misleading. Fix: drop the dead check and document that abs
🟡 LOW Module-global workspaces Map couples concurrent environments — bench/src/swe-bench-env.ts
Line 31.
const workspaces = new Map<string, Ws>()lives at module scope, so twocreateSweBenchEnvironment()calls in the same process (e.g., a test matrix or parallel bench runs) share one registry. Handle.id is the tempdir path (mkdtempSync-unique) so collisions are effectively impossible and this is not a correctness bug today, but it defeats environment isolation and makes the surface stateful in a surprising place. Fix: move the Map inside createSweBenchEnvironment so each environment owns its own registry (closes over it).
🟡 LOW hardcoded split:'test' — no parameter to select train/dev/Verified splits — bench/src/swe-bench-env.ts
Line 42: adapter.loadTasks({ limit: poolN, split: 'test' }) — the split is hardcoded. SWE-bench_Verified exposes a 'train' split (unverified instances, suitable for development/calibration without touching the evaluation set). For the calibration mode in swe-self-improve.mts, using 'test' instances for calibration burns evaluation data. The comment acknowledges this is 'the Verified split' by design, but exposing a parameter (or a separate 'calibrate on train, evolve on test' path) would improve eval hygiene. Currently documented, so low severity.
🟡 LOW isTestPath is a soft hint, not a hard boundary; doc overstates protection — bench/src/swe-bench-env.ts
Line 25 + header lines 8-9. The regex
/(^|\/)(tests?)\//+test_.*\.py$|_test\.py$|conftest\.py$/misses real test-adjacent paths:testing/dirs (numpy/pandas use them),pytest.ini,setup.cfg,tox.ini,pytest_*.py. The header says 'edit_file refuses test files' as if it were a hard anti-cheat; in reality the real protection is that the official harness re-applies the goldtest_patchon top of the agent's patch (swe-bench.ts:120-150), so editing tests is self-defeating anyway. Fix: soften the comment to 'best-effort hint'
🟡 LOW isTestPath regex false positive on non-test directories — bench/src/swe-bench-env.ts
Line 25: /(^|/)(tests?)// matches any path containing a 'tests' or 'test' directory component. This blocks legitimate source directories like 'testsupport/', 'tests-fixtures/', 'tests-common/', 'test_helpers/' (if structured as a package). SWE-bench repos are diverse; a repo like pytest or Django may have source in such paths. The regex should be /(^|/)(tests?)// and NOT match if the component has additional characters beyond 'test'/'tests' — i.e., /(^|/)(tests?)$/|(^|/)tests?// combined with checking the full path segment. In practice this is unlikely to matter for most SWE-bench instances, but it's a correctness risk worth tightening.
🟡 LOW read_file loads entire file into memory before truncating — bench/src/swe-bench-env.ts
Line 105.
readFileSync(join(ws.dir, p), 'utf8')reads the whole file, THEN truncates to 24_000 chars on line 106. A pathological multi-GB file in a repo could OOM the bench process before the truncation runs. Low probability for SWE-bench source trees but cheap to harden: statSync first and reject above a size cap, or stream+truncate. Same applies to edit_file reading content on line 119.
🟡 LOW .swe-run-* temp dirs not in .gitignore — bench/swe-self-improve.mts
Line 41:
mkdtempSync(join(process.cwd(), '.swe-run-'))creates temp dirs under the repo root with prefix.swe-run-. Neither the root nor bench.gitignorecontains a pattern for.swe-run-*. If the process crashes before line 60'srmSync, the directory remains as untracked in the working tree. Fix: add.swe-run-*to.gitignore.
🟡 LOW CALIBRATE env-flag and numeric env parsing are not validated — bench/swe-self-improve.mts
Line 25
if (process.env.CALIBRATE)is truthy for ANY non-empty string including '0', 'false', 'no' — a user exporting CALIBRATE=0 to disable it still gets the calibrate path. Lines 22/26/45-46/55-57 use Number(process.env.X ?? default) with no NaN/integer guard: INNER_TURNS='' → Number('')=0 (agent gets zero inner turns, silently no-ops); INNER_TURNS='abc' → NaN propagates into the budget math. Conventional for env-gated scripts and tsx will still run, but a one-line assertion (e.g. Number.isFinite(innerTurns) && innerTurns>0) would fai
🟡 LOW outDir leaked on runStrategyEvolution failure — bench/swe-self-improve.mts
Line 41:
outDir = mkdtempSync(...)creates a temp directory. Line 42-59:runStrategyEvolutionruns insideawait. If it throws,rmSync(outDir, ...)on line 60 never executes — the temp dir is never cleaned up. The.catch()on line 76 only logs the error, no cleanup. Fix: wra
🟡 LOW rmSync cleanup skipped on throw; temp dir leaks under CWD — bench/swe-self-improve.mts
Line 41 creates outDir via mkdtempSync(join(process.cwd(), '.swe-run-')) and line 60 rmSync's it after runStrategyEvolution resolves. If runStrategyEvolution rejects (e.g. pool exhausted, router auth failure, OOM), control jumps straight to the main().catch on line 76 and rmSync never runs — the .swe-run-XXXXXX dir (with authored strategy .ts modules) is left in the user's CWD. Not a correctness bug and outDir is
🟡 LOW Equal-compute claim across topologies is loose — budget means structurally different things per strategy — examples/ablation-suite/ablation.ts
Comment at the topology knob (line 23) calls budget the 'equal-compute unit (refine: max shots; fanout: rollout width)'. Verified against strategy.ts: refine(budget=N)→depthStrategy with maxShots=N (analyst feedback between each); sample(budget=N)→breadthStrategy with width=N (N independent rollouts, no analyst); sampleThenRefine(budget=N)→⌈N/2⌉ explore + (N-⌈N/2⌉) refine-with-critique. So budget=2 across arms = {2 depth shots w/ analyst, 2 parallel samples no analyst, 1 sample+1 refine}. These are shots-comparable but NOT tokens-comparable (analyst adds completions). NOT a code defect — the autopsy prints real costUsd/tokensIn/tokensOut per arm and Δ$ vs
🟡 LOW No test coverage for exported functions — examples/ablation-suite/ablation.ts
runAblationandprintAutopsyare exported public functions with logic-bearing code (knob validation via unwiredKnobs gate, result aggregation, pairedBootstrap display). Zero test coverage. While this is inexamples/, other example modules (e.g.coding-benchmark/) have tests. The unwired-knob validation in particular would benefit from a unit test: confirm that setting each declared-but-unwired knob throws with the expected primitive name.
🟡 LOW pairedBootstrap statistic/field mismatch in printAutopsy — examples/ablation-suite/ablation.ts
Line 162 passes
statistic: 'mean'to pairedBootstrap but line 164 readsb.median(notb.mean) as the point estimate. pairedBootstrap returns both fields regardless of the statistic parameter, so this does not crash. However, the 95% CI bounds (b.low,b.high) are computed for the mean-difference bootstrap distribution, while the reported point estimate is the median of that distribution. The two can differ materially for small-n skewed data (this file defaults toHOLDOUT_N=6). The table header says 'Δresolve
🟡 LOW Redundant mkdirSync before write — examples/self-improving-coder/self-improving-coder.ts
Line 144:
mkdirSync(ws.dir, { recursive: true }).ws.dirwas already created bymkdtempSync(join(tmpdir(), 'sic-'))inopen()(line 113) and is registered in theworkspacesMap. Calling mkdirSync on it is a no-op. Dead call — remove.
🟡 LOW Workspace temp dirs leak on any non-close exit path — examples/self-improving-coder/self-improving-coder.ts
Workspaces live in a module-level
Map<string, Ws>(line 89) and are deleted only whenclose(handle)runs (line 163-168). On process crash, uncaught throw inrunStrategyEvolution, or a worker-timeout path that doesn't reachclose, the/tmp/sic-*dirs (containing agent-authored lib.py) persist indefinitely. Themain()catch at [lines 284-286](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/examples/self-improving-coder/self-
🟡 LOW examples/README.md does not list self-improving-coder — examples/self-improving-coder/self-improving-coder.ts
examples/README.md table goes through entry #18 (self-improving-loop) and does not mention self-improving-coder. The README is explicitly 'A learning path. Read the examples in order' — a new example not in the path is documentation drift. Add a row to the table and a Quickstart line if desired (this is the RSI spine composed on the substrate, a meaningful showcase).
🟡 LOW outDir temp directory leaks on runStrategyEvolution failure — examples/self-improving-coder/self-improving-coder.ts
Line 239:
outDir = mkdtempSync(join(process.cwd(), '.sic-run-'))creates the directory. Line 258:rmSync(outDir, ...)cleans up on success. But ifrunStrategyEvolutionthrows (line 240), control jumps to the catch at [line 284](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd478da3b34ccf85ef8c2a1d02f31/exam
🟡 LOW run_tests handled in call but not advertised in tools — inconsistent surface — examples/self-improving-coder/self-improving-coder.ts
Lines 151-153 handle
run_testsincall(), returning pytest results. Buttools()(lines 120-127) does not includerun_testsin the returned tool definitions — intentionally, per the comment at line 125-126. The system prompt at [line 178-179](https://github.com/tangle-network/agent-runtime/blob/bd127783c37cd4
🟡 LOW write_file path validation is vestigial — the validated path is never used — examples/self-improving-coder/self-improving-coder.ts
Lines 140-146:
const p = String(args.path ?? ''); if (!p.endsWith('lib.py') || p.includes('..') || p.startsWith('/')) return 'ERROR: only lib.py is writable'; ... writeFileSync(join(ws.dir, 'lib.py'), String(args.content ?? '')). The validation runs onp(the agent-supplied path) but the write at line 145 hardcodes'lib.py'and ignorespentirely. So a request withpath: 'subdir/lib.py'passes validation and silently writes to the workspace root, whilepath: 'foo.py'
tangletools · 2026-06-28T04:50:16Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — bd127783
Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T04:50:16Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 6e747231
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T05:18:28Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 148.4s (2 bridge agents) |
| Total | 148.4s |
💰 Value — sound
Adds contamination-proof coding AgenticSurfaces (generated-lib + SWE-bench Verified) and wires them into the existing runStrategyEvolution spine — all substrate-native, no reinvention, coherent with the codebase's grain.
- What it does: Adds four new files across two layers: (1) examples/self-improving-coder/self-improving-coder.ts — a contamination-proof coding AgenticSurface (seed-derived wire-protocol library, graded by real pytest, with a $0 calibration gate) wired into runStrategyEvolution; (2) examples/ablation-suite/ablation.ts — a cost-aware one-knob-delta runner that sweeps self-improvement knobs (topology, budget) over
- Goals it achieves: 1. Demonstrate the RSI (recursive self-improvement) spine on a real, contamination-proof coding domain — not a toy counter — so the substrate's self-improvement primitives are exercised on tasks a model cannot memorize. 2. Provide a calibration gate that proves the task is solvable AND the grader discriminates (reference→100%, stub→0%) — enforcing calibrate-before-measure before spending compute.
- Assessment: The change is well-built and follows the codebase's grain perfectly. Every new component uses existing substrate primitives: runStrategyEvolution (src/runtime/strategy-evolution.ts), AgenticSurface (src/runtime/strategy.ts:76), the existing createSweBenchAdapter (bench/src/benchmarks/swe-bench.ts), and createChatClient from agent-eval. The SWE-bench Environment delegates scoring to the existing ad
- Better / existing approach: none — this is the right approach. Checked: examples/strategy-evolution/ (toy counter domain, different purpose), examples/self-improving-loop/ (offline scripted demo, different paradigm), examples/coding-benchmark/ (runProfileMatrix + firewall paradigm, not AgenticSurface), bench/src/benchmarks/swe-bench.ts (adapter-only, not an AgenticSurface). No existing equivalent for any of these four new fi
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/kimi-for-coding/k2p7: opencode produced no stream output: ; opencode/zai-coding-plan/glm-5.2: opencode produced no stream output:
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: opencode produced no stream output: ; opencode/kimi-for-coding/k2p7: opencode produced no stream output: ; opencode/deepseek/deepseek-v4-pro: opencode produced no stream output:
🔎 Heuristic Signals
🟡 Cruft: console debug added bench/swe-self-improve.mts
- console.log(
═══ SWE-bench CALIBRATION — ${workerModel}, baseline=refine, ${n} real bugs ═══)
🟡 Cruft: magic number added bench/swe-self-improve.mts
console.log(` ${t.id.padEnd(32)} resolved=${r.resolved} completions=${r.completions} shots=${r.shots} (${Math.round((Date.now() - t0) / 1000)}s)`)
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
LOOP LAYERS (on the analyze keystone): - surface-worker.ts: graded-worker seam — a makeWorkerAgent running runAgentic over an AgenticSurface, settling with the surface score as the deliverable verdict (driver spawns/steers graded workers). - gepa-driver-prompt.ts: optimize the driver compose-prompt on TRAIN with an EXECUTABLE JudgeConfig (selfImprove + gepaProposer, reading the surface score — not an LLM judge), frozen, certified. - self-improving-supervisor.ts: one-call DX composing supervise(+analyze) over the graded seam. - ablation.ts: driverSteer + optimize knobs WIRED (blind/steered/self-improving) over the same supervise(); per-task try/catch resilience; cost+significance autopsy intact. - index.ts: re-export AnalystRegistry + MakeWorkerAgent from the loops barrel (host authors its seam). #402 HARDENING: - swe-bench-env: temp-dir leak fix (try/catch+rmSync on clone/checkout fail), symlink realpath-jail (read+edit), workspaces Map into the closure, exported jailPath/isTestPath + unit tests. - self-improving-coder: read_file path guard (symmetric with write_file), dead run_tests handler gone. - swe-self-improve.mts moved to bench/src/ (now under typecheck). Typecheck clean (examples + core); biome clean; 82 supervise tests green on the keystone.
|
Folded into #404 (the comprehensive self-improving-supervisor PR) — all the examples + instrument here, plus the review findings from this PR addressed there (temp-dir leak, symlink realpath-jail, read_file guard, run_tests removal, per-task resilience, biome, typecheck coverage, swe-bench-env unit tests). Closing in favor of #404. |
What — the self-improvement spine, composed cleanly, two ways
1.
examples/self-improving-coder/— the substrate's RSI loop with NOTHING hand-rolled: anAgentProfileworker over anAgenticSurface, gated byrunStrategyEvolution(authors strategies from TRAIN losses → ONE promotion decision on a FRESH holdout viapromotionGate). Adaptive data analysis is structurally impossible (disjoint offsets, holdout read once). Bundled task is a contamination-proof generated library (constants per-seed → no memorization), graded by real pytest.$0calibration gate (reference→100%, stub→0%). The toy task is deliberately simple — a capable model aces it, so the gate correctly returns no-promotion (you can't show self-improvement without headroom; the harness refuses to fake it).2.
bench/src/swe-bench-env.ts+bench/swe-self-improve.mts— the proper, no-cheating frontier run: SWE-bench Verified as anAgenticSurface. The agent clones the repo at base_commit, explores + makes surgical edits via tools (source only — test files path-jailed), andscore()grades thegit diffwith the official swebench Docker harness. The substrate drives the agentic loop (no hand-rolled tool-loop, no seeing the hidden tests or gold patch).runStrategyEvolutionenforces a disjoint train→freeze→holdout split — adaptive reuse impossible. Frontier worker (gemini-2.5-pro).CALIBRATEmode runs the baseline on a few bugs first (cost gate).The discipline (why this is trustworthy)
build-with-agent-runtimedo-NOT-reinvent list). Only theEnvironments are new.Verified:
swebenchgold-patch grading resolves a real astropy bug in Docker (97s); the harness runs end-to-end and the gate fires correctly. SWE-bench Environment calibration in progress.UPDATE — the proper frontier run is PROVEN to work (calibration, gemini-2.5-pro)
gemini-2.5-proover the SWE-benchEnvironment, baselinerefine, on real Verified bugs:astropy-13236RESOLVED (17 tool-calls, 7 min of real exploration + surgical edits, confirmed by the official swebench Docker harness) —astropy-12907/-13033unresolved (the agent gave up early at 1-2 turns). Baseline = 1/3 = 33% — the correctable middle band (real headroom, no cheating: it never saw the hidden tests or gold). The fullrunStrategyEvolutionrun (train→freeze→holdout) is firing — can a learned strategy make the flaky agent persist and resolve MORE?