feat(bench): add DABStep adapter and SDK compat#409
Conversation
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 1 (1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 279.2s (2 bridge agents) |
| Total | 279.2s |
💰 Value — sound-with-nits
Adds a 19th benchmark adapter (DABStep) that mirrors the established commit0/programbench pattern exactly; ships clean but bundles unrelated runtime/sandbox compat edits under a bench-titled PR.
- What it does: Adds a DABStep adapter to bench/src/benchmarks/dabstep.ts (and registers it in adapters.ts). Live mode reads DABSTEP_DIR's released dataset.csv/splits/files/grade.py via an inline Python loader; fixture mode (DABSTEP_FIXTURES=1) reads bench/fixtures/dabstep.json. Scoring is delegated to the official grade.py through a 42-line bench/scripts/dabstep_judge.py driver that imports grade() via importlib
- Goals it achieves: Let agent-runtime agents be scored on DABStep's data-analysis tasks (EnvCommons/DABStep) using its official deterministic grade.py — no LLM judge. The benchmark suite grows from 18 to 19 adapters behind the single resolveAdapter registry, so any profile/prompt change can be A/B'd against one more real deterministic judge. Secondary goal (per the PR body): refresh bench deps and bring the runtime u
- Assessment: The adapter is a textbook application of the existing BenchmarkAdapter contract — it reuses _harness.ts (benchRoot/runVenvPython/runVenvScriptStdin), follows the fixture/live/preflight/judge/goldArtifact/output shape used by commit0.ts:218 and programbench.ts:165 verbatim, delegates scoring to the benchmark's own harness, and fails loud rather than fabricating a score. No self-authored judge, no L
- Better / existing approach: For the adapter itself: none — this is the right approach and matches commit0/programbench line-for-line. Searched for any pre-existing DABStep wiring (git log + grep across .ts/.mts/.py/.md/.json); there is none, so no duplication. For the bundle: the runtime/sandbox 0.9.5 compat edits (environment-provider.ts status field, index.ts type re-exports, trata-gepa.mts rename, decoder-live.mts cast) w
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
DABStep adapter is wired into the existing adapter registry and follows the exact commit0 pattern (fixtures-mode plumbing + fail-loud preflight + judge delegates to official grade.py); immediately reachable by every gate/replay runner via resolveAdapter('dabstep').
- Integration: Fully wired. createDabstepAdapter is registered at bench/src/adapters.ts:36, so resolveAdapter('dabstep') returns it — and resolveAdapter/ADAPTERS is consumed by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180, research-gate.mts:40, corpus-replay.mts:217, and re-exported as public API in bench/src/index.ts:13. HARNESS.md:205 documents the setup, and package.json description bumps 18→19 adapt
- Fit with existing patterns: Textbook fit. It mirrors commit0 (bench/src/benchmarks/commit0.ts) field-for-field: fixtures-mode flag pattern, readMeta shape, selectRows, loadFixtures vs official loader, runVenvScriptStdin delegation to a scripts/*_judge.py bridge, fail-loud preflight. Uses the shared _harness.ts helpers (benchRoot, runVenvPython, runVenvScriptStdin) exactly as designed. No competing dabstep implementation exis
- Real-world viability: Robust on error paths. preflight (dabstep.ts:148-159) fails loud with actionable guidance for every missing piece: DABSTEP_DIR unset, missing dataset.csv, missing split file, missing grade.py, missing files dir, and even does a 1-row loadOfficialTasks probe to validate the CSV parses. judge (dabstep.ts:182-198) surfaces the bridge's
errorfield and defaults score to 0 rather than crashing on mal - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
💰 Value Audit
🟡 PR title advertises only the DABStep adapter but bundles a runtime/sandbox compat pass [proportion] ``
src/runtime/environment-provider.ts:509,912 adds a required
statusfield to PromptResult returns (sandbox 0.9.5 compat); src/runtime/index.ts:459-476 adds three new type re-exports; bench/src/trata-gepa.mts:370 renames driverTarget→proposerTarget (agent-eval API rename); bench/src/decoder-live.mts:50 tightens a TS cast. None of these are required by dabstep.ts itself, which imports only OutputAdapter (bench/src/benchmarks/dabstep.ts:14) — a long-stable symbol. These are the compile/run tax of
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Summary
Scope note
The runtime edits here are compatibility/type-surface fixes required by the bench package build against current sandbox and agent-eval versions. This PR does not change the runtime loop behavior.
Verification