feat(bench): add DABStep adapter and SDK compat by drewstone · Pull Request #409 · tangle-network/agent-runtime

drewstone · 2026-06-29T04:05:34Z

Summary

add a DABStep adapter to @tangle-network/agent-bench using the existing BenchmarkAdapter contract
delegate scoring to official DABStep grade.py and fail loud when DABSTEP_DIR / released dataset.csv are missing
package bench support files needed by existing adapters, refresh bench deps, and align sandbox 0.9.5 / agent-eval 0.100.0 compatibility
expose existing runtime executor types publicly so bench can typecheck against the local runtime source without private imports or runtime-loop changes

Scope note

The runtime edits here are compatibility/type-surface fixes required by the bench package build against current sandbox and agent-eval versions. This PR does not change the runtime loop behavior.

Verification

DABSTEP_FIXTURES=1 pnpm exec tsx --test bench/src/benchmarks/dabstep.test.mts
node --import tsx --input-type=module -e "import('./bench/src/adapters.ts').then(({resolveAdapter})=>{ const a=resolveAdapter('dabstep'); console.log(a.name); })"
pnpm typecheck
pnpm --dir bench exec tsc --noEmit -p tsconfig.json
pnpm build
pnpm install --frozen-lockfile
pnpm --dir bench install --frozen-lockfile
pnpm --dir bench pack --dry-run
pnpm run docs:check
git diff --check

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	1 (1 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	279.2s (2 bridge agents)
Total	279.2s

💰 Value — sound-with-nits

Adds a 19th benchmark adapter (DABStep) that mirrors the established commit0/programbench pattern exactly; ships clean but bundles unrelated runtime/sandbox compat edits under a bench-titled PR.

What it does: Adds a DABStep adapter to bench/src/benchmarks/dabstep.ts (and registers it in adapters.ts). Live mode reads DABSTEP_DIR's released dataset.csv/splits/files/grade.py via an inline Python loader; fixture mode (DABSTEP_FIXTURES=1) reads bench/fixtures/dabstep.json. Scoring is delegated to the official grade.py through a 42-line bench/scripts/dabstep_judge.py driver that imports grade() via importlib
Goals it achieves: Let agent-runtime agents be scored on DABStep's data-analysis tasks (EnvCommons/DABStep) using its official deterministic grade.py — no LLM judge. The benchmark suite grows from 18 to 19 adapters behind the single resolveAdapter registry, so any profile/prompt change can be A/B'd against one more real deterministic judge. Secondary goal (per the PR body): refresh bench deps and bring the runtime u
Assessment: The adapter is a textbook application of the existing BenchmarkAdapter contract — it reuses _harness.ts (benchRoot/runVenvPython/runVenvScriptStdin), follows the fixture/live/preflight/judge/goldArtifact/output shape used by commit0.ts:218 and programbench.ts:165 verbatim, delegates scoring to the benchmark's own harness, and fails loud rather than fabricating a score. No self-authored judge, no L
Better / existing approach: For the adapter itself: none — this is the right approach and matches commit0/programbench line-for-line. Searched for any pre-existing DABStep wiring (git log + grep across .ts/.mts/.py/.md/.json); there is none, so no duplication. For the bundle: the runtime/sandbox 0.9.5 compat edits (environment-provider.ts status field, index.ts type re-exports, trata-gepa.mts rename, decoder-live.mts cast) w
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

DABStep adapter is wired into the existing adapter registry and follows the exact commit0 pattern (fixtures-mode plumbing + fail-loud preflight + judge delegates to official grade.py); immediately reachable by every gate/replay runner via resolveAdapter('dabstep').

Integration: Fully wired. createDabstepAdapter is registered at bench/src/adapters.ts:36, so resolveAdapter('dabstep') returns it — and resolveAdapter/ADAPTERS is consumed by gate-cli.mts:39, trata-gate.mts:150, aec-gate.mts:180, research-gate.mts:40, corpus-replay.mts:217, and re-exported as public API in bench/src/index.ts:13. HARNESS.md:205 documents the setup, and package.json description bumps 18→19 adapt
Fit with existing patterns: Textbook fit. It mirrors commit0 (bench/src/benchmarks/commit0.ts) field-for-field: fixtures-mode flag pattern, readMeta shape, selectRows, loadFixtures vs official loader, runVenvScriptStdin delegation to a scripts/*_judge.py bridge, fail-loud preflight. Uses the shared _harness.ts helpers (benchRoot, runVenvPython, runVenvScriptStdin) exactly as designed. No competing dabstep implementation exis
Real-world viability: Robust on error paths. preflight (dabstep.ts:148-159) fails loud with actionable guidance for every missing piece: DABSTEP_DIR unset, missing dataset.csv, missing split file, missing grade.py, missing files dir, and even does a 1-row loadOfficialTasks probe to validate the CSV parses. judge (dabstep.ts:182-198) surfaces the bridge's error field and defaults score to 0 rather than crashing on mal
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

💰 Value Audit

🟡 PR title advertises only the DABStep adapter but bundles a runtime/sandbox compat pass [proportion] ``

src/runtime/environment-provider.ts:509,912 adds a required status field to PromptResult returns (sandbox 0.9.5 compat); src/runtime/index.ts:459-476 adds three new type re-exports; bench/src/trata-gepa.mts:370 renames driverTarget→proposerTarget (agent-eval API rename); bench/src/decoder-live.mts:50 tightens a TS cast. None of these are required by dabstep.ts itself, which imports only OutputAdapter (bench/src/benchmarks/dabstep.ts:14) — a long-stable symbol. These are the compile/run tax of

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260629T041952Z}

drewstone added 2 commits June 28, 2026 22:04

feat(bench): add DABStep adapter

233af21

docs(api): refresh runtime docs

3571141

tangletools reviewed Jun 29, 2026

View reviewed changes

docs(api): stabilize runtime summary

e150acc

drewstone changed the title ~~feat(bench): add DABStep adapter~~ feat(bench): add DABStep adapter and SDK compat Jun 29, 2026

drewstone merged commit 5d610e7 into main Jun 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bench): add DABStep adapter and SDK compat#409

feat(bench): add DABStep adapter and SDK compat#409
drewstone merged 3 commits into
mainfrom
feat/bench-dabstep-adapter

drewstone commented Jun 29, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope note

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound

💰 Value Audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 29, 2026 •

edited

Loading