diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md index 3746b36..0847d25 100644 --- a/docs/results/autodata-live.md +++ b/docs/results/autodata-live.md @@ -1,58 +1,123 @@ -# Autodata live result: a false null, autopsied, then a real (clean) null +# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust) Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier -solver models, to manufacture training examples that separate a strong solver from a weak one -(the discriminative reward). The headline is a null — but the path to it is the result. - -## What happened, in order - -1. **First runs looked like a null with a *negative* gap.** Across two tier pairs — - `glm-4.5-air` vs `glm-5.2`, then `groq/llama-3.1-8b-instant` vs `gemini-2.5-pro` — every run - reported 0 accepted and a strong−weak gap *below zero* (plain −0.47, then −1.00). A frontier - model scoring *below* an 8B on reasoning questions is not credible. - -2. **Autopsy (a direct probe on the real judge) found an artifact, not a finding.** At the solver's - `maxTokens: 1024`, the strong **reasoning** model (`gemini-2.5-pro`, and `glm-5.2` before it) - spent its whole budget on hidden reasoning and returned **empty visible content** on hard - prompts — which the judge scored 0. So "strong" was being scored 0 for *answering nothing*, - manufacturing a false negative gap. The trivial cost-gate smoke ("reply ok") didn't trigger it, - so it slipped through. (Confirmed: the same prompt at `maxTokens: 8000` → gemini answers in - 956 chars and scores 1.00.) - -3. **Fix (this PR).** The solver now uses a reasoning-safe `maxTokens` (8000) **and fails loud on - empty content** — an empty answer is a measurement failure, never a silent 0 that corrupts the - gap (the repo's no-silent-zeros rule). The model tier is now an env knob - (`AUTODATA_WEAK_MODEL` / `AUTODATA_STRONG_MODEL` / `…_CHALLENGER_MODEL` / `…_JUDGE_MODEL`), and - the price table covers the wide tier. - -4. **The clean result.** Re-run with the fix, `llama-3.1-8b` vs `gemini-2.5-pro`: - - | metric | value | - |---|---| - | accepted (discriminating) examples | **0 / 3** | - | plain gap (n=1) | 0.000 | - | refined best-gap per slot (n=3) | 0.006 | - | Δ (refined − plain) | **+0.006 — no meaningful widening** | - | spend | $0.09 | - - The gap is now **~0, not negative** — `gemini-2.5-pro` and `llama-3.1-8b` score about **equally**. +solvers, to manufacture training examples that separate a strong solver from a weak one (the +discriminative reward of the Autodata / Agentic-Self-Instruct method). + +**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold +**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the +method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65 +∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted +0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails +(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So: +**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n +mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried. + +## The two levers that turned the null into a positive + +The earlier null ("a small model performs as well as a frontier one") had TWO compounding causes, +both fixed here: + +1. **The question leaked the answer / asked for recall.** The challenger wrote lookup-style questions + whose answer sat in the provided context, so an 8B read it out as well as a frontier model. + Fix — the **non-extractive causal challenger**: it must author CAUSAL / COMPARATIVE / MECHANISM / + THESIS-CONSISTENCY questions whose answer is DERIVED, the context must hold premises but not state + the conclusion, the solver no longer sees the rubric (the mark scheme), and the judge now sees the + context and scores a dedicated `reasoning` dimension LOW when the answer merely restates it (the + paper's negative criterion). On reject, the fold steers per reason ("too easy" → go non-extractive + and harder; "too hard" → ease; "not discriminative" → sharpen). + +2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most + canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from + pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not + memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which + post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context. + +## Setup (all env-overridable) + +| role | model | why | +|---|---|---| +| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall | +| strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap | +| challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) | +| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) | + +Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's +challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the +live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.) + +## The judge is reliable (checked before trusting any gap) + +A controlled probe scored one genuinely-strong vs one genuinely-weak answer to the same question, 3× +each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs weak `[0.23, 0.13, +0.17]` (mean 0.18) — a consistent **0.82** separation, ranking strong above weak every time. So a +measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse +errors — `deepseek` is the better grader here.) + +## The result — the gap opens, examples are accepted + +**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**, +**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats +reasoning). + +**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:** + +| run | accepted | gap widening (plain → refined) | note | +|---|---|---|---| +| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) | +| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated | +| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle | + +**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the +accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on +these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so +acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs. + +## An autopsied accepted example (real discrimination, both answers read) + +> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were +> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's +> output differ from the intended behavior, and why is this failure mode problematic? + +- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that + uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing + the point of the MoE. Correct. +- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive + the failure consequence; it never reaches "all experts averaged → specialization lost." + +When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or +leakage (the answer is not in the context). **But it does not open reliably.** In the independent +re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b` +correctly explained that high positional locality routes consecutive tokens to the same expert → +over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned +fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the +variance that makes acceptance a coin-flip. ## The finding -On these auto-generated, doc-grounded questions a small model performs as well as a frontier one, -because **the answer is extractable from the provided context** — reading beats reasoning, so model -capability does not separate and no example clears the discriminative bar. This is *not* a -model-tier problem (we used a genuine 8B-vs-frontier gap); it is a **question-difficulty** problem. +The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger +(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard +(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold +**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs). -The lever is therefore the **challenger**, not the model tier: to open a real gap the challenger must -generate **non-extractive, reasoning-heavy** questions (multi-step derivations, numerical claims that -require following the paper's argument) — which is exactly the move the Autodata paper relies on -("the agent's initial attempt was usually a high-level summary question… subsequent rounds moved the -questions toward specific algorithmic steps the paper's actual argument required"). Our challenger, -on a single section, mostly produces extractable questions. Making it harder is the next experiment. +But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must +*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these +MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a +coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the +exact small-n shape that has repeatedly looked positive here and washed out at power. ## Status -Mechanism: proven end-to-end on real frontier models, cost-tracked, fail-loud. Empirical -discrimination: a clean null on extractive questions. The harness is now trustworthy (no empty-→0 -artifact); the open lever is challenger difficulty. +Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt +dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught). +Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise +`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the +*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed +direction, not a confirmed win. + +## Reproduce + +``` +dotenvx run -f .env -- pnpm tsx src/autodata/run.ts # causal, default Mixtral doc +dotenvx run -f .env -- pnpm tsx src/autodata/calibrate.ts # recall-vs-causal A/B, same doc +``` diff --git a/src/autodata/build-dataset.ts b/src/autodata/build-dataset.ts index a2e1853..9f6cd68 100644 --- a/src/autodata/build-dataset.ts +++ b/src/autodata/build-dataset.ts @@ -7,16 +7,17 @@ * AND for the challenger's FIRST drafts (plain), plus the cost ledger split by role. */ -import { mkdir, writeFile } from 'node:fs/promises' +import { appendFile, mkdir, rm, writeFile } from 'node:fs/promises' import { dirname } from 'node:path' import { CostLedger } from '@tangle-network/agent-eval' import { + type AttemptRecord, createDataCreationLoop, discriminativeAcceptRule, type ExampleEvaluation, } from './data-creation-loop' import { type GroundedDoc, groundDoc } from './grounding' -import { buildAutodataRoles, type RouterCallRecord } from './router-roles' +import { buildAutodataRoles, type ChallengerStyle, type RouterCallRecord } from './router-roles' export interface DiscriminativeThresholds { minStrong?: number @@ -36,6 +37,10 @@ export interface AutodataDatasetConfig { maxRetries?: number thresholds?: DiscriminativeThresholds models?: { challenger?: string; weak?: string; strong?: string; judge?: string } + /** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */ + style?: ChallengerStyle + /** Where to write the per-attempt autopsy JSONL (every candidate, accepted or rejected). */ + attemptsPath?: string signal?: AbortSignal } @@ -65,11 +70,15 @@ export interface AutodataDatasetResult { plainGaps: number[] agenticGaps: number[] refinedGaps: number[] + /** Every evaluated candidate (accepted or rejected) with both solvers' answers — the autopsy trail. */ + attempts: AttemptRecord[] cost: CostLedger costPerExampleUsd: number | null /** How many router calls were priced by the router vs rate-estimated. */ callProvenance: { router: number; estimated: number } outPath: string + /** Where the per-attempt autopsy JSONL was written (null if not requested). */ + attemptsPath: string | null } function mean(xs: number[]): number | null { @@ -80,16 +89,31 @@ function isGrounded(s: AutodataDatasetConfig['source']): s is GroundedDoc { return typeof (s as GroundedDoc).doc === 'string' } -function challengerInstruction(doc: string): string { +/** The causal (default) user instruction — pairs with the non-extractive challenger system prompt. */ +function causalInstruction(doc: string): string { return ( `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` + - `Write ONE hard exam question grounded in this excerpt. It must require multi-step reasoning ` + - `over the excerpt (a small model should get it wrong, a strong model right), never a verbatim ` + - `lookup. Return STRICT JSON: {"context": string, "question": string, "reference": string, ` + - `"rubric": string[] }.` + `Write ONE hard CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY question grounded in this ` + + `excerpt — never a recall / lookup / definition. The CONTEXT must give the solver the premises ` + + `but MUST NOT state the answer; the answer has to be DERIVED. Return STRICT JSON: ` + + `{"context": string, "question": string, "reference": string, "rubric": string[] }.` ) } +/** The recall (baseline) user instruction — pairs with the extractive challenger; for calibration. */ +function recallInstruction(doc: string): string { + return ( + `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` + + `Write ONE exam question grounded in this excerpt, with a short context excerpt the question is ` + + `answerable from, a reference answer, and a 2-3 item rubric. Return STRICT JSON: ` + + `{"context": string, "question": string, "reference": string, "rubric": string[] }.` + ) +} + +function instructionFor(style: ChallengerStyle): (doc: string) => string { + return style === 'recall' ? recallInstruction : causalInstruction +} + /** Run the full pipeline: ground → loop → JSONL. Returns the calibration numbers + cost. */ export async function buildAutodataDataset( config: AutodataDatasetConfig, @@ -110,6 +134,7 @@ export async function buildAutodataDataset( } const ledger = new CostLedger() + const style: ChallengerStyle = config.style ?? 'causal' const roles = buildAutodataRoles({ apiKey: config.apiKey, @@ -118,13 +143,27 @@ export async function buildAutodataDataset( weakModel: config.models?.weak, strongModel: config.models?.strong, judgeModel: config.models?.judge, + challengerStyle: style, ledger, onCall, }) + // Per-attempt autopsy trail: every candidate (accepted or rejected) is appended as one JSONL row + // with both solvers' answer text + scores, so a null is diagnosable from the raw answers. + const attempts: AttemptRecord[] = [] + const attemptsPath = config.attemptsPath ?? null + if (attemptsPath) { + await mkdir(dirname(attemptsPath), { recursive: true }) + await rm(attemptsPath, { force: true }) + } + const onAttempt = async (rec: AttemptRecord): Promise => { + attempts.push(rec) + if (attemptsPath) await appendFile(attemptsPath, `${JSON.stringify({ ...rec, style })}\n`) + } + const result = await createDataCreationLoop({ doc: source.doc, - baseInstruction: challengerInstruction, + baseInstruction: instructionFor(style), challenger: roles.challenger, weakSolver: roles.weakSolver, strongSolver: roles.strongSolver, @@ -134,6 +173,7 @@ export async function buildAutodataDataset( samples: config.samples ?? 3, maxRetries: config.maxRetries ?? 4, cost: ledger, + onAttempt, signal: config.signal, }) @@ -164,9 +204,11 @@ export async function buildAutodataDataset( plainGaps: result.plainGaps, agenticGaps: result.agenticGaps, refinedGaps: result.refinedGaps, + attempts, cost: result.cost, costPerExampleUsd: result.cost.costPerCompletedTask(), callProvenance: provenance, outPath: config.outPath, + attemptsPath, } } diff --git a/src/autodata/calibrate.ts b/src/autodata/calibrate.ts new file mode 100644 index 0000000..b024aa7 --- /dev/null +++ b/src/autodata/calibrate.ts @@ -0,0 +1,127 @@ +/** + * Autodata calibration: does the non-extractive CAUSAL challenger widen the strong/weak gap vs the + * old RECALL challenger, on the SAME grounded document? This is the lever's proof — the prior null + * was "recall question → answer leaks into context → an 8B reads it out → gap ~0". If the fix works, + * the causal arm's mean gap (and accepted count) clears the recall arm's by a clear margin. + * + * Run (key never printed): + * dotenvx run -f /home/drew/company/devops/secrets/agent-state.env -- \ + * pnpm tsx src/autodata/calibrate.ts + * + * Env knobs: AUTODATA_URL, AUTODATA_FOCUS, AUTODATA_TARGET, AUTODATA_SAMPLES, AUTODATA_MAXRETRIES, + * AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY. + */ + +import { type AutodataDatasetResult, buildAutodataDataset } from './build-dataset' +import type { AttemptRecord } from './data-creation-loop' +import { DEFAULT_SOURCE_URL, groundDoc } from './grounding' +import { + CHALLENGER_MODEL, + type ChallengerStyle, + JUDGE_MODEL, + STRONG_SOLVER_MODEL, + smokeTestModels, + WEAK_SOLVER_MODEL, +} from './router-roles' + +function envInt(name: string, fallback: number): number { + const raw = process.env[name] + if (!raw) return fallback + const n = Number.parseInt(raw, 10) + if (!Number.isFinite(n) || n <= 0) throw new Error(`${name}='${raw}' is not a positive integer`) + return n +} + +function mean(xs: number[]): number | null { + return xs.length === 0 ? null : xs.reduce((a, b) => a + b, 0) / xs.length +} + +function fmt(x: number | null, d = 3): string { + return x === null ? 'n/a' : x.toFixed(d) +} + +/** Mean strong/weak gap over the quality-clean attempts of one arm — the discriminating power. */ +function armGap(attempts: AttemptRecord[]): number | null { + return mean(attempts.filter((a) => a.qualityOk).map((a) => a.gap)) +} + +async function runArm(args: { + apiKey: string + source: Awaited> + style: ChallengerStyle + target: number + samples: number + maxRetries: number +}): Promise { + return buildAutodataDataset({ + apiKey: args.apiKey, + source: args.source, + outPath: `data/autodata-calib-${args.style}.jsonl`, + attemptsPath: `data/autodata-calib-${args.style}-attempts.jsonl`, + style: args.style, + target: args.target, + samples: args.samples, + maxRetries: args.maxRetries, + }) +} + +async function main(): Promise { + const apiKey = process.env.TANGLE_API_KEY ?? process.env.TANGLE_ROUTER_KEY + if (!apiKey) throw new Error('no TANGLE_API_KEY in env — run under dotenvx so the key is set') + + const url = process.env.AUTODATA_URL ?? DEFAULT_SOURCE_URL + const focus = process.env.AUTODATA_FOCUS ?? 'attention' + const target = envInt('AUTODATA_TARGET', 2) + const samples = envInt('AUTODATA_SAMPLES', 2) + const maxRetries = envInt('AUTODATA_MAXRETRIES', 2) + + console.log('Autodata calibration · recall vs causal challenger (same doc)\n') + console.log( + ` challenger/judge=${CHALLENGER_MODEL}/${JUDGE_MODEL} weak=${WEAK_SOLVER_MODEL} strong=${STRONG_SOLVER_MODEL}`, + ) + + const smoke = await smokeTestModels({ + apiKey, + models: [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL], + }) + const dead = smoke.filter((s) => !s.ok) + if (dead.length > 0) + throw new Error(`cost gate failed — empty from: ${dead.map((d) => d.model).join(', ')}`) + console.log(' cost gate ok (all models returned content)\n') + + const source = await groundDoc({ url, focus }) + console.log(`Grounded on ${source.url} section='${source.headingPath}'\n`) + + // Same doc, same budget, only the challenger prompt changes — a clean A/B on the lever. + const recall = await runArm({ apiKey, source, style: 'recall', target, samples, maxRetries }) + const causal = await runArm({ apiKey, source, style: 'causal', target, samples, maxRetries }) + + const recallGap = armGap(recall.attempts) + const causalGap = armGap(causal.attempts) + + console.log('— Calibration result —') + console.log( + ` RECALL attempts=${recall.attempts.length} mean gap=${fmt(recallGap)} accepted=${recall.accepted.length}`, + ) + console.log( + ` CAUSAL attempts=${causal.attempts.length} mean gap=${fmt(causalGap)} accepted=${causal.accepted.length}`, + ) + if (recallGap !== null && causalGap !== null) { + const delta = causalGap - recallGap + console.log( + ` Δ (causal − recall) = ${delta >= 0 ? '+' : ''}${delta.toFixed(3)} ` + + (delta >= 0.1 + ? '→ the causal challenger WIDENS the gap (the lever works)' + : '→ no meaningful widening from the causal challenger (honest null)'), + ) + } + + const totalUsd = recall.cost.summary().totalCostUsd + causal.cost.summary().totalCostUsd + console.log(`\n spend: $${totalUsd.toFixed(4)} (both arms)`) + console.log(` trails: ${recall.attemptsPath} ${causal.attemptsPath}`) +} + +main().catch((err) => { + console.error(err) + process.exit(1) +}) diff --git a/src/autodata/data-creation-loop.test.ts b/src/autodata/data-creation-loop.test.ts index df917cc..ec42325 100644 --- a/src/autodata/data-creation-loop.test.ts +++ b/src/autodata/data-creation-loop.test.ts @@ -1,5 +1,6 @@ import { describe, expect, it } from 'vitest' import { + type AttemptRecord, createDataCreationLoop, discriminativeAcceptRule, qualityCheck, @@ -127,4 +128,45 @@ describe('createDataCreationLoop (offline)', () => { expect(stored).toHaveLength(2) expect(result.cost.summary().totalCostUsd).toBeGreaterThan(0) }) + + it('emits a per-attempt record (with both solvers’ answers) for every candidate, accepted or rejected', async () => { + const attempts: AttemptRecord[] = [] + const result = await createDataCreationLoop({ + doc: groundingDoc, + baseInstruction, + challenger: challengerClient(), + weakSolver: solverClient('weak'), + strongSolver: solverClient('strong'), + judge: buildRubricJudge(), + target: 1, + samples: 3, + maxRetries: 4, + onAttempt: (rec) => { + attempts.push(rec) + }, + }) + + // The first slot's first draft is the EASY example (rejected "too easy"), then the fold steers + // to a HARD example (accepted) — so we observe at least one reject AND one accept. + expect(attempts.length).toBeGreaterThanOrEqual(2) + const rejected = attempts.filter((a) => !a.decision.accept) + const accepted = attempts.filter((a) => a.decision.accept) + expect(rejected.length).toBeGreaterThanOrEqual(1) + expect(accepted.length).toBeGreaterThanOrEqual(1) + expect(result.accepted).toHaveLength(1) + + // Every emitted attempt carries the raw answers + scores the autopsy needs. + for (const a of attempts) { + expect(a.weak.samples).toHaveLength(3) + expect(a.strong.samples).toHaveLength(3) + expect(typeof a.weak.samples[0]?.answer).toBe('string') + expect(a.gap).toBeCloseTo(a.strong.mean - a.weak.mean, 6) + } + + // The first attempt (iteration 0) is the plain draft and should NOT discriminate; a later + // attempt should — the fold widened the gap. + const plain = attempts.find((a) => a.iteration === 0) + expect(plain?.decision.accept).toBe(false) + expect(Math.max(...attempts.map((a) => a.gap))).toBeGreaterThan(plain?.gap ?? 0) + }) }) diff --git a/src/autodata/data-creation-loop.ts b/src/autodata/data-creation-loop.ts index 9022cbb..4d026a7 100644 --- a/src/autodata/data-creation-loop.ts +++ b/src/autodata/data-creation-loop.ts @@ -168,12 +168,26 @@ const solverOutput: OutputAdapter<{ answer: string }> = { }, } +/** One solver attempt's recorded answer + judge score — the unit of the per-example autopsy dump. */ +export interface SolverSample { + readonly answer: string + readonly score: number + readonly notes?: string +} + +/** A solver's N× sampled result: the variance-reduced mean the accept rule compares + the raw samples. */ +export interface SolverEval { + readonly mean: number + readonly samples: readonly SolverSample[] +} + // ── N× solver sampling = an inline FANOUT driver over runLoop ──────────────────────────────── // // A "round" returns N independent solver tasks (no fold between them) → the kernel runs all N, // the `llmJudge`-as-validator scores each against the rubric, and we AVERAGE the N scores (the // variance-reduced estimate the accept rule compares — not argmax). runLoop already aggregated -// the N calls' cost, so we roll its total into the ledger under this solver's channel. +// the N calls' cost, so we roll its total into the ledger under this solver's channel. Each sample's +// ANSWER TEXT and score are captured (not just the mean) so a null is autopsy-able per example. async function sampleSolverScore(args: { solver: SandboxClient solverSpec: AgentRunSpec @@ -183,9 +197,10 @@ async function sampleSolverScore(args: { channel: string ledger: CostLedger signal?: AbortSignal -}): Promise { +}): Promise { const { solver, solverSpec, example, judge, samples, channel, ledger } = args + const collected: SolverSample[] = [] const validator: Validator<{ answer: string }> = { async validate(out, ctx) { const score = await judge.score({ @@ -193,6 +208,7 @@ async function sampleSolverScore(args: { scenario: solveScenario, signal: ctx.signal, }) + collected.push({ answer: out.answer, score: score.composite, notes: score.notes }) return { valid: !score.failed, score: score.composite, notes: score.notes } }, } @@ -225,16 +241,52 @@ async function sampleSolverScore(args: { tags: { role: channel }, }) - const scored = result.iterations.filter((it) => it.verdict).map((it) => it.verdict?.score ?? 0) - if (scored.length === 0) + if (collected.length === 0) throw new Error(`${channel}: every solver sample errored — no score to average`) - return scored.reduce((a, b) => a + b, 0) / scored.length + const mean = collected.reduce((a, b) => a + b.score, 0) / collected.length + return { mean, samples: collected } } // ── The challenger refine driver — the FOLD ────────────────────────────────────────────────── type ChallengerDecision = 'refine' | 'accept' | 'reject' +/** + * The steer the fold appends per reject reason. The accept rule and this map are a matched pair: + * "too easy" almost always means the answer leaked into the context (recall), so the steer is to go + * non-extractive; "too hard" eases the derivation; "not discriminative" sharpens the contrast. + */ +function foldGuidance(why: string): string { + if (/leaked/i.test(why)) { + return ( + 'The reference answer leaked into the context. Rewrite so the CONTEXT holds ONLY the premises ' + + 'and the answer must be DERIVED, never quoted.' + ) + } + if (/too easy/i.test(why)) { + return ( + 'The weak solver scored too high — the answer is extractable from the context (recall / ' + + 'leakage). Ask a CAUSAL or COMPARATIVE question whose answer is NOT stated in the context and ' + + 'must be derived, and delete any sentence from the context that states the conclusion.' + ) + } + if (/too hard/i.test(why)) { + return ( + 'Even the strong solver missed it — ease it. Keep it causal, but shorten the required ' + + 'reasoning chain and make sure EVERY premise needed to derive the answer is present in the ' + + 'context (without stating the conclusion).' + ) + } + if (/not discriminative/i.test(why)) { + return ( + 'Both solvers scored similarly — sharpen the contrast: the question must need a multi-step ' + + 'derivation a small model gets wrong, while keeping every premise a strong model needs in the ' + + 'context.' + ) + } + return 'Write a new example that fixes exactly the stated problem.' +} + function challengerDriver( maxRetries: number, baseInstruction: (doc: string) => string, @@ -247,9 +299,9 @@ function challengerDriver( if (last?.verdict?.valid) return [] // accepted → stop if (history.length >= maxRetries) return [] // out of budget → stop // THE FOLD: read WHY the last example was rejected and rewrite the instruction to target it. - // "too easy" → make it harder; "too hard" → ease it; "leaked" → keep the answer out of context. + // Each accept-rule reason maps to a specific steer toward "just right". const why = last?.verdict?.notes ?? 'rejected' - const prompt = `${baseInstruction(task.doc)}\n\nYour previous example was REJECTED: ${why}. Write a new example that fixes exactly that.` + const prompt = `${baseInstruction(task.doc)}\n\nYour previous example was REJECTED: ${why}.\n${foldGuidance(why)}` return [{ ...task, prompt }] }, decide(history) { @@ -269,6 +321,26 @@ export interface ExampleEvaluation { readonly decision: AcceptDecision } +/** + * One fully-evaluated challenger attempt — emitted to `onAttempt` for EVERY candidate, accepted or + * rejected. The whole point is diagnosability: a null is only a finding if you can read the actual + * answers and see WHY the gap didn't open (weak read it out of the context, judge couldn't separate, + * strong erred, …). This carries the answer text both solvers produced, not just the scalar scores. + */ +export interface AttemptRecord { + /** Which target slot (outer loop index). */ + readonly slotIndex: number + /** Which refine iteration within the slot (0 = first draft / plain). */ + readonly iteration: number + readonly example: DataExample + readonly weak: SolverEval + readonly strong: SolverEval + readonly gap: number + readonly decision: AcceptDecision + /** Whether the deterministic quality gate (no leak, real rubric) passed before solving. */ + readonly qualityOk: boolean +} + // ── The loop ──────────────────────────────────────────────────────────────────────────────── export interface DataCreationConfig { @@ -301,16 +373,27 @@ export interface DataCreationConfig { readonly corpus?: Corpus /** Cost ledger to record into. Default a fresh `CostLedger`. */ readonly cost?: CostLedger + /** + * Observability hook: fired once per evaluated candidate (accepted OR rejected) with both solvers' + * answer text + scores. Wire a JSONL writer here so every null is autopsy-able. Errors are not + * swallowed — a throwing observer fails the run loud. + */ + readonly onAttempt?: (rec: AttemptRecord) => void | Promise readonly signal?: AbortSignal } -/** Default solver prompt: ground the answer in the context, score against the numbered rubric. */ +/** + * Default solver prompt: the premises + the question — never the rubric. The rubric is the grading + * key; showing it to the solver hands the weak model the mark scheme and closes the very gap the + * discriminative reward is trying to open. The answer is not stated in the context (the challenger + * withholds the conclusion), so the prompt tells the solver to DERIVE it. + */ function defaultRenderSolverPrompt(example: DataExample, sampleIndex: number): string { return ( - `Answer the QUESTION using only the CONTEXT.\n\n` + + `Answer the QUESTION by reasoning from the CONTEXT. The answer is NOT stated verbatim in the ` + + `context — you must derive it.\n\n` + `CONTEXT:\n${example.context}\n\n` + - `QUESTION:\n${example.question}\n\n` + - `RUBRIC (you are graded on each):\n${example.rubric.map((r, i) => `${i + 1}. ${r}`).join('\n')}\n` + + `QUESTION:\n${example.question}\n` + `[sample ${sampleIndex}]` ) } @@ -370,10 +453,11 @@ export async function createDataCreationLoop( // apply the accept rule. It stashes each iteration's evaluation so the loop can read back the // ACCEPTED one (the agentic arm) and the FIRST draft (the plain calibration baseline). const evaluations = new Map() + const emptyEval: SolverEval = { mean: 0, samples: [] } const validator: Validator = { async validate(example, ctx) { const quality = qualityCheck(example) - const weakScore = quality.ok + const weak = quality.ok ? await sampleSolverScore({ solver: config.weakSolver, solverSpec: weakSolverSpec, @@ -384,8 +468,8 @@ export async function createDataCreationLoop( ledger: cost, signal: ctx.signal, }) - : 0 - const strongScore = quality.ok + : emptyEval + const strong = quality.ok ? await sampleSolverScore({ solver: config.strongSolver, solverSpec: strongSolverSpec, @@ -396,12 +480,27 @@ export async function createDataCreationLoop( ledger: cost, signal: ctx.signal, }) - : 0 + : emptyEval + const weakScore = weak.mean + const strongScore = strong.mean const decision = quality.ok ? accept({ strongScore, weakScore }) : { accept: false, reason: quality.reason } const gap = strongScore - weakScore evaluations.set(ctx.iteration, { example, weakScore, strongScore, gap, decision }) + // Emit the full attempt (accept OR reject) so the null is diagnosable from the raw answers. + if (config.onAttempt) { + await config.onAttempt({ + slotIndex: i, + iteration: ctx.iteration, + example, + weak, + strong, + gap, + decision, + qualityOk: quality.ok, + }) + } return { valid: decision.accept, score: gap, notes: decision.reason } }, } @@ -424,7 +523,12 @@ export async function createDataCreationLoop( tags: { role: 'challenger' }, }) - const plain = evaluations.get(0) + // The "plain" (un-refined) baseline is the FIRST candidate that was actually evaluated — the + // earliest recorded iteration. Reading a hardcoded index 0 silently drops the baseline whenever + // the first challenger draft errored (e.g. unparseable JSON), which is exactly when the slot's + // first SUCCESSFUL draft sits at a later index. Take the min recorded iteration instead. + const firstIteration = [...evaluations.keys()].sort((a, b) => a - b)[0] + const plain = firstIteration === undefined ? undefined : evaluations.get(firstIteration) if (plain) plainGaps.push(plain.gap) const slotGaps = [...evaluations.values()].map((e) => e.gap) diff --git a/src/autodata/grounding.ts b/src/autodata/grounding.ts index 130a177..29a7e3d 100644 --- a/src/autodata/grounding.ts +++ b/src/autodata/grounding.ts @@ -3,17 +3,23 @@ * (`politeFetch` → `htmlToText` → `chunkMarkdown`). Fetches the page, strips it to text, chunks it, * and selects ONE content-rich chunk as the grounding excerpt the challenger writes questions from. * - * The default source is the "Attention Is All You Need" paper via ar5iv (arXiv's LaTeX→HTML service), - * a stable real paper with multi-step-reasoning content that affords genuinely discriminating - * questions. Any arXiv / ar5iv URL works; pass a `focus` term to bias chunk selection toward a section. + * The default source is the Mixtral-of-Experts paper (arXiv 2401.04088) via ar5iv. The doc CHOICE is + * load-bearing: a hard question only separates a small solver from a frontier one if the small solver + * cannot just RECALL the answer from pretraining. The canonical "Attention Is All You Need" paper is + * the worst case — an 8B has memorized it, so even reasoning questions are answerable from memory and + * the strong/weak gap collapses (an empirically-verified null). Mixtral (Jan 2024) post-dates the 8B + * weak solver's knowledge cutoff, so it must reason from the provided context — which is where a + * non-extractive causal question opens a real gap. Any arXiv / ar5iv URL works; pass `focus` to bias + * chunk selection toward a section. */ import { chunkMarkdown } from '../chunking' import { htmlToText } from '../sources/html' import { politeFetch } from '../sources/http' -/** A stable real arXiv paper (Transformer / "Attention Is All You Need") rendered to HTML by ar5iv. */ -export const DEFAULT_SOURCE_URL = 'https://ar5iv.labs.arxiv.org/html/1706.03762' +/** A stable real arXiv paper (Mixtral of Experts) rendered to HTML by ar5iv — see the note above on + * why a NON-memorized doc is required for the strong/weak gap to open. */ +export const DEFAULT_SOURCE_URL = 'https://ar5iv.labs.arxiv.org/html/2401.04088' export interface GroundDocOptions { url: string diff --git a/src/autodata/index.ts b/src/autodata/index.ts index e211f33..c1372a7 100644 --- a/src/autodata/index.ts +++ b/src/autodata/index.ts @@ -18,6 +18,7 @@ export { } from './build-dataset' export { type AcceptDecision, + type AttemptRecord, createDataCreationLoop, type DataCreationConfig, type DataCreationResult, @@ -26,6 +27,8 @@ export { type ExampleEvaluation, qualityCheck, type SolverArtifact, + type SolverEval, + type SolverSample, } from './data-creation-loop' export { DEFAULT_SOURCE_URL, @@ -37,6 +40,7 @@ export { type AutodataRoles, buildAutodataRoles, CHALLENGER_MODEL, + type ChallengerStyle, DEFAULT_BASE_URL, JUDGE_MODEL, parseDataExample, diff --git a/src/autodata/router-roles.ts b/src/autodata/router-roles.ts index 39801ef..30b5769 100644 --- a/src/autodata/router-roles.ts +++ b/src/autodata/router-roles.ts @@ -3,16 +3,20 @@ * * One transport seam — `routerChat` — POSTs `/chat/completions` and returns content + exact token * usage + a per-call USD cost (the router's own cost when it returns one, else a documented - * rate-table estimate over the exact token counts; the source is flagged, never silently faked). - * The four roles are materialized on top of it: - * • challenger (glm-5.2) → an `inProcessSandboxClient` that asks for ONE JSON example and parses it - * • weak solver (qwen-2.5-7b) / strong solver (qwen3-235b) → `inProcessSandboxClient` answer workers - * • judge (glm-5.2) → an `llmJudge` `JudgeConfig` whose transport is a `sandbox-sdk` ChatClient - * wrapping `routerChat`; the judge's own spend is recorded into the same `CostLedger` (the loop - * only aggregates challenger + solver spend, so the judge channel would otherwise be invisible). + * rate-table estimate over the exact token counts; the source is flagged, never silently faked). It + * retries only TRANSIENT failures (the router's "upstream capacity, retry shortly" 503s, 429/502/504, + * network blips, per-request timeouts) with bounded backoff; a non-transient non-2xx fails loud. + * The four roles are materialized on top of it (all models env-overridable — see the constants below): + * • challenger (`deepseek-v4-flash`) → an `inProcessSandboxClient` that authors ONE NON-EXTRACTIVE + * causal/comparative/mechanism/thesis-consistency JSON example and parses it. + * • weak solver (`groq/llama-3.1-8b-instant`) / strong solver (`gemini-2.5-pro`) → answer workers. + * • judge (`deepseek-v4-flash`) → an `llmJudge` `JudgeConfig` whose transport is a `sandbox-sdk` + * ChatClient wrapping `routerChat`; the judge's own spend is recorded into the same `CostLedger` + * (the loop only aggregates challenger + solver spend, so the judge channel is recorded here). * - * glm-5.2 returns empty content unless `max_tokens` is generous, so every glm call is floored and the - * judge is built with an explicit `maxTokens`. + * A reasoning model spends its budget on hidden reasoning and returns EMPTY visible content when + * `max_tokens` is too low (a glm/gemini footgun), so every call is floored and solvers fail loud on + * empty content rather than scoring a non-answer as 0 (which would corrupt the gap). */ import { @@ -30,20 +34,20 @@ import type { DataExample, SolverArtifact } from './data-creation-loop' export const DEFAULT_BASE_URL = 'https://router.tangle.tools/v1' -// A genuine small-vs-large tier in one model family. The brief specified the Qwen tier -// (`qwen/qwen-2.5-7b-instruct` weak, `qwen/qwen3-235b-a22b` strong), but on the live Tangle router -// EVERY Qwen id 401s `No API key configured for model` for this key — the Qwen upstream is not -// provisioned (verified by probing `/v1/chat/completions` across the `/v1/models` catalog). The -// GLM family IS served, so the real tier here is the smallest GLM (`glm-4.5-air`) as the weak solver -// vs the latest (`glm-5.2`) as the strong solver. Same family, a real generational/size gap; swap -// these constants back to the Qwen ids once the router provisions that upstream. +// The proven-working tier on the live Tangle router, every id env-overridable: +// • weak solver `groq/llama-3.1-8b-instant` — an 8B whose knowledge cutoff predates the default +// grounding doc, so on non-memorized content it must REASON from the context (it can't recall), +// which is what lets a hard causal question separate it from a frontier solver. +// • strong solver `gemini-2.5-pro` — a frontier reasoner (a real wide capability gap vs the 8B). +// • challenger + judge `deepseek-v4-flash` — a capable, fast, RELIABLE author/grader that is a +// DIFFERENT family from both solvers (so the judge does not favour either solver's style). The +// brief's `glm-5.2` works too when the router has GLM capacity; swap it back via env when it is up. // The solver tier is the experiment's load-bearing knob — a real strong>weak capability gap is -// required for any example to clear the discriminative bar. Overridable by env so the tier can be -// swept without a code change (e.g. AUTODATA_STRONG_MODEL=gemini-2.5-pro AUTODATA_WEAK_MODEL=groq/llama-3.1-8b-instant). -export const WEAK_SOLVER_MODEL = process.env.AUTODATA_WEAK_MODEL ?? 'glm-4.5-air' -export const STRONG_SOLVER_MODEL = process.env.AUTODATA_STRONG_MODEL ?? 'glm-5.2' -export const CHALLENGER_MODEL = process.env.AUTODATA_CHALLENGER_MODEL ?? 'glm-5.2' -export const JUDGE_MODEL = process.env.AUTODATA_JUDGE_MODEL ?? 'glm-5.2' +// required for any example to clear the discriminative bar. +export const WEAK_SOLVER_MODEL = process.env.AUTODATA_WEAK_MODEL ?? 'groq/llama-3.1-8b-instant' +export const STRONG_SOLVER_MODEL = process.env.AUTODATA_STRONG_MODEL ?? 'gemini-2.5-pro' +export const CHALLENGER_MODEL = process.env.AUTODATA_CHALLENGER_MODEL ?? 'deepseek-v4-flash' +export const JUDGE_MODEL = process.env.AUTODATA_JUDGE_MODEL ?? 'deepseek-v4-flash' interface ModelPrice { /** USD per 1M input tokens. */ @@ -60,7 +64,9 @@ interface ModelPrice { */ const PRICE_TABLE: Record = { 'glm-4.5-air': { inputPerM: 0.2, outputPerM: 0.6 }, + 'glm-4.6': { inputPerM: 0.6, outputPerM: 2.2 }, 'glm-5.2': { inputPerM: 0.95, outputPerM: 3.0 }, + 'deepseek-v4-flash': { inputPerM: 0.27, outputPerM: 0.41 }, // Wide-tier solver pair (a genuine small-vs-frontier capability gap). Approximate router rates. 'groq/llama-3.1-8b-instant': { inputPerM: 0.05, outputPerM: 0.08 }, 'gemini-2.5-pro': { inputPerM: 1.25, outputPerM: 10.0 }, @@ -87,6 +93,10 @@ export interface RouterChatInput { jsonMode?: boolean signal?: AbortSignal onCall?: (rec: RouterCallRecord) => void + /** Per-request deadline so a stalled upstream can't hang the loop. Default 60s. */ + timeoutMs?: number + /** Bounded retries on TRANSIENT failures (503/429/502/504, network, timeout). Default 4. */ + maxRetries?: number } export interface RouterChatResult { @@ -128,27 +138,78 @@ function estimateCostUsd(model: string, promptTokens: number, completionTokens: * exact prompt/completion token counts, and a USD cost (router-reported when present, else * rate-estimated over the real token counts) with its source flagged. */ +/** Transient upstream statuses the router itself tells us to "retry shortly" — safe to re-issue. */ +const transientStatuses = new Set([429, 502, 503, 504]) + +function sleep(ms: number): Promise { + return new Promise((resolve) => setTimeout(resolve, ms)) +} + +/** Exponential backoff with jitter: ~1s, 2s, 4s, 8s (capped 10s) — bounded by maxRetries. */ +function backoffMs(attempt: number): number { + return Math.min(10_000, 2 ** attempt * 1000) + Math.floor(Math.random() * 250) +} + export async function routerChat(input: RouterChatInput): Promise { const baseUrl = (input.baseUrl ?? DEFAULT_BASE_URL).replace(/\/$/, '') const max_tokens = Math.max(input.maxTokens, maxTokensFloor(input.model)) - const res = await fetch(`${baseUrl}/chat/completions`, { - method: 'POST', - headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${input.apiKey}` }, - signal: input.signal, - body: JSON.stringify({ - model: input.model, - messages: input.messages, - max_tokens, - temperature: input.temperature ?? 0.2, - stream: false, - ...(input.jsonMode ? { response_format: { type: 'json_object' } } : {}), - }), + const timeoutMs = input.timeoutMs ?? 60_000 + const maxRetries = input.maxRetries ?? 4 + const payload = JSON.stringify({ + model: input.model, + messages: input.messages, + max_tokens, + temperature: input.temperature ?? 0.2, + stream: false, + ...(input.jsonMode ? { response_format: { type: 'json_object' } } : {}), }) - if (!res.ok) { - const detail = await res.text().catch(() => res.statusText) - throw new Error(`router ${res.status} for ${input.model}: ${detail.slice(0, 400)}`) + + // One non-streaming chat call, retried only on TRANSIENT failures (the router's own + // "upstream capacity, retry shortly" 503s, plus 429/502/504, network errors, and per-request + // timeouts). A non-transient non-2xx (401/400/404) fails loud immediately — never silently. + let body: Record | undefined + let lastTransient = '' + for (let attempt = 0; attempt <= maxRetries; attempt++) { + // Combine the caller's abort with a per-request deadline so a stalled upstream can't hang us. + const deadline = AbortSignal.timeout(timeoutMs) + const signal = input.signal ? AbortSignal.any([input.signal, deadline]) : deadline + let res: Response + try { + res = await fetch(`${baseUrl}/chat/completions`, { + method: 'POST', + headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${input.apiKey}` }, + signal, + body: payload, + }) + } catch (err) { + // The caller's own abort is final; a timeout/network blip is transient and retryable. + if (input.signal?.aborted) throw err + lastTransient = `network/timeout: ${String(err).slice(0, 120)}` + if (attempt < maxRetries) { + await sleep(backoffMs(attempt)) + continue + } + throw new Error( + `router call for ${input.model} failed after ${attempt + 1} tries — ${lastTransient}`, + ) + } + if (!res.ok) { + const detail = await res.text().catch(() => res.statusText) + if (transientStatuses.has(res.status) && attempt < maxRetries) { + lastTransient = `${res.status}: ${detail.slice(0, 120)}` + await sleep(backoffMs(attempt)) + continue + } + throw new Error(`router ${res.status} for ${input.model}: ${detail.slice(0, 400)}`) + } + body = (await res.json()) as Record + break + } + if (!body) { + throw new Error( + `router call for ${input.model} exhausted ${maxRetries + 1} tries — ${lastTransient}`, + ) } - const body = (await res.json()) as Record const choice = (body.choices as { message?: { content?: string }; finish_reason?: string }[])?.[0] const usage = (body.usage ?? {}) as { prompt_tokens?: number; completion_tokens?: number } const promptTokens = usage.prompt_tokens ?? 0 @@ -232,20 +293,68 @@ export function parseDataExample(text: string): DataExample { // ── The roles ───────────────────────────────────────────────────────────────────────────────── +// The non-extractive challenger. The prior loop nulled because the context LEAKED the answer: +// the question was recall/lookup, so an 8B read it out as well as a frontier model and the gap +// collapsed to ~0. The fix is the paper's: ask CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY +// questions whose answer is an INFERENCE, and withhold the conclusion from the context the solver +// sees ("problems only, no solution") so the answer must be DERIVED, never quoted. const challengerSystem = - 'You write ONE hard exam question from a source document. The question must require multi-step ' + - 'reasoning a small model would get wrong but a strong model would get right — never a verbatim ' + - 'lookup. Return STRICT JSON and nothing else: ' + - '{"context": string, "question": string, "reference": string, "rubric": string[] }. ' + - 'The context is a short excerpt from the document; the question must NOT be answerable by copying ' + - 'a sentence; the reference is the correct answer; the rubric is 2-3 scoring criteria. ' + - 'Do NOT put the reference answer verbatim inside the context.' + 'You are an exam author. From a source excerpt, write ONE hard question that tests REASONING, ' + + 'not recall.\n\n' + + 'The question MUST be exactly one of these kinds:\n' + + ' • CAUSAL — "why does X fail / what breaks if Y is omitted".\n' + + ' • COMPARATIVE — "how does the tradeoff of X differ from Y, and why".\n' + + ' • MECHANISM — "walk through how X produces Y; what fails if a step is skipped".\n' + + ' • THESIS-CONSISTENCY — "which of two explanations the text offers is more consistent with its ' + + 'overall conclusion, and how would the other undermine it".\n' + + 'It must NEVER be recall / lookup / definition / enumeration ("what is X", "which header", ' + + '"list the steps", "name the ...").\n\n' + + 'ANTI-LEAKAGE (mandatory):\n' + + ' • The CONTEXT must contain ONLY the premises/evidence the solver needs. It MUST NOT contain a ' + + 'sentence that states the answer or the conclusion — the answer has to be DERIVED from the ' + + 'premises, not quotable from the context.\n' + + ' • The REFERENCE is the correct DERIVED conclusion plus its reasoning chain. It must NOT appear ' + + 'verbatim in the context.\n\n' + + 'The RUBRIC is 2-3 criteria that reward the REASONING STEPS (e.g. "identifies that X depends on ' + + 'Y", "explains why removing Y causes the failure", "ties it to the stated conclusion") — never ' + + '"mentions the keyword".\n\n' + + 'Return STRICT JSON and nothing else: ' + + '{"context": string, "question": string, "reference": string, "rubric": string[] }.' + +// The recall / extractive challenger — the prior (nulling) behavior, kept ONLY as the calibration +// baseline. It writes a normal question answerable straight from the excerpt, so the answer is in +// the context and a small model reads it out as well as a frontier one. The causal vs recall gap is +// the calibration that proves the lever. +const recallChallengerSystem = + 'You write ONE exam question from a source excerpt. Provide a short CONTEXT excerpt the question ' + + 'is answerable from, the QUESTION, the REFERENCE answer, and a 2-3 item RUBRIC. ' + + 'Return STRICT JSON and nothing else: ' + + '{"context": string, "question": string, "reference": string, "rubric": string[] }.' + +/** Which challenger prompt to materialize: the non-extractive causal author, or the recall baseline. */ +export type ChallengerStyle = 'causal' | 'recall' +function challengerSystemFor(style: ChallengerStyle): string { + return style === 'recall' ? recallChallengerSystem : challengerSystem +} + +// The reasoning judge. It sees the CONTEXT the solver saw, so it can tell a derived answer from one +// that merely restates the context. The `reasoning` dimension IS the negative criterion: an answer +// that paraphrases the context without deriving the conclusion scores near 0 there — which is what +// pulls a recall-style weak answer below the strong model's derivation and opens the gap. const judgeSystem = - 'You are grading a candidate ANSWER to a question against a RUBRIC and a REFERENCE answer. ' + - 'Return JSON {"dimensions":{"rubric_coverage":N,"correctness":N},"notes":"..."} with each score ' + - 'in [0,1]. rubric_coverage = the fraction of rubric criteria the answer satisfies; correctness = ' + - 'how well the answer agrees with the reference. Be strict: a vague or partial answer scores low.' + 'You grade a candidate ANSWER to a REASONING question. You are given the CONTEXT the solver was ' + + 'shown, the REFERENCE answer, and the RUBRIC.\n' + + 'Return JSON {"dimensions":{"rubric_coverage":N,"correctness":N,"reasoning":N},"notes":"..."} ' + + 'with each score in [0,1].\n' + + ' • rubric_coverage = fraction of the rubric criteria the answer genuinely satisfies.\n' + + ' • correctness = how well the DERIVED conclusion agrees with the reference.\n' + + ' • reasoning = quality of the DERIVATION. Score HIGH only if the answer works through WHY/HOW ' + + 'from the premises. Score near 0 if it merely RESTATES or QUOTES the context, asserts the ' + + 'conclusion without justifying it, or is vague.\n' + + 'NEGATIVE CRITERION: an answer that just paraphrases the context without deriving the conclusion ' + + 'is a recall answer — it must score LOW on reasoning AND correctness, no matter how many keywords ' + + 'it echoes. Be strict.' export interface RouterRolesConfig { apiKey: string @@ -254,6 +363,8 @@ export interface RouterRolesConfig { weakModel?: string strongModel?: string judgeModel?: string + /** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */ + challengerStyle?: ChallengerStyle /** Judge spend is recorded here directly (the loop captures only challenger + solver spend). */ ledger: CostLedger /** Optional sink for every router call's cost provenance. */ @@ -309,6 +420,7 @@ function solverClient(cfg: RouterRolesConfig, model: string): SandboxClient { function challengerClient(cfg: RouterRolesConfig): SandboxClient { const model = cfg.challengerModel ?? CHALLENGER_MODEL + const system = challengerSystemFor(cfg.challengerStyle ?? 'causal') return inProcessSandboxClient({ onPrompt: async (prompt, ctx): Promise => { const r = await routerChat({ @@ -316,7 +428,7 @@ function challengerClient(cfg: RouterRolesConfig): SandboxClient { baseUrl: cfg.baseUrl, model, messages: [ - { role: 'system', content: challengerSystem }, + { role: 'system', content: system }, { role: 'user', content: prompt }, ], maxTokens: 1500, @@ -399,9 +511,18 @@ function rubricJudge(cfg: RouterRolesConfig): JudgeConfig { description: 'fraction of the rubric criteria the answer satisfies', }, { key: 'correctness', description: 'agreement with the reference answer' }, + { + key: 'reasoning', + description: + 'quality of the derivation; near 0 if the answer merely restates/quotes the context', + }, ], scale: 'unit', + // The judge sees the CONTEXT so it can distinguish a derived answer from a restated one (the + // negative criterion). Without it, a paraphrase of the context is indistinguishable from real + // reasoning and the gap stays closed. renderUser: ({ artifact }) => + `CONTEXT THE SOLVER WAS GIVEN:\n${artifact.example.context}\n\n` + `REFERENCE ANSWER:\n${artifact.example.reference}\n\n` + `RUBRIC:\n${artifact.example.rubric.map((r, i) => `${i + 1}. ${r}`).join('\n')}\n\n` + `CANDIDATE ANSWER:\n${artifact.answer}`, @@ -412,6 +533,7 @@ function rubricJudge(cfg: RouterRolesConfig): JudgeConfig { export function buildAutodataRoles(cfg: RouterRolesConfig): AutodataRoles { return { challenger: challengerClient(cfg), + // weak/strong solvers + judge are style-independent; only the challenger prompt changes. weakSolver: solverClient(cfg, cfg.weakModel ?? WEAK_SOLVER_MODEL), strongSolver: solverClient(cfg, cfg.strongModel ?? STRONG_SOLVER_MODEL), judge: rubricJudge(cfg), diff --git a/src/autodata/run.ts b/src/autodata/run.ts index cca767f..928b69c 100644 --- a/src/autodata/run.ts +++ b/src/autodata/run.ts @@ -8,7 +8,8 @@ * pnpm tsx src/autodata/run.ts * * Env knobs: AUTODATA_URL, AUTODATA_FOCUS, AUTODATA_TARGET, AUTODATA_SAMPLES, AUTODATA_MAXRETRIES, - * AUTODATA_OUT, TANGLE_API_KEY (or TANGLE_ROUTER_KEY). + * AUTODATA_OUT, AUTODATA_ATTEMPTS (per-attempt autopsy JSONL), + * AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY (or TANGLE_ROUTER_KEY). */ import { buildAutodataDataset } from './build-dataset' @@ -42,6 +43,7 @@ async function main(): Promise { const samples = envInt('AUTODATA_SAMPLES', 3) const maxRetries = envInt('AUTODATA_MAXRETRIES', 4) const outPath = process.env.AUTODATA_OUT ?? 'data/autodata-dataset.jsonl' + const attemptsPath = process.env.AUTODATA_ATTEMPTS ?? 'data/autodata-attempts.jsonl' // ── 1. COST GATE: one cheap call per model, all must return non-empty content before the burn ── console.log('Autodata · cost gate (one call per model)\n') @@ -77,6 +79,7 @@ async function main(): Promise { apiKey, source: grounded, outPath, + attemptsPath, target, samples, maxRetries, @@ -92,6 +95,27 @@ async function main(): Promise { console.log(` ${ex.decision.reason}`) } + // ── 4b. Autopsy: the single widest-gap attempt, with BOTH solvers' actual answers ── + // A gap number is only a finding if you can read why it opened. Show the strongest discrimination + // we saw (highest gap, accepted or not) so a human can confirm it is real reasoning, not an + // artifact: the weak model should genuinely fail the reasoning and the strong model get it. + const best = result.attempts.filter((a) => a.qualityOk).sort((a, b) => b.gap - a.gap)[0] + if (best) { + const oneLine = (s: string): string => s.replace(/\s+/g, ' ').trim() + console.log('\n— Autopsy: widest-gap attempt (read the answers, confirm real discrimination) —') + console.log(` Q: ${oneLine(best.example.question)}`) + console.log(` reference: ${oneLine(best.example.reference).slice(0, 240)}`) + console.log( + ` gap=${best.gap.toFixed(2)} (${best.decision.accept ? 'ACCEPTED' : 'rejected'}: ${best.decision.reason})`, + ) + console.log(` WEAK mean=${best.weak.mean.toFixed(2)}`) + for (const [i, s] of best.weak.samples.entries()) + console.log(` [w${i} score=${s.score.toFixed(2)}] ${oneLine(s.answer).slice(0, 220)}`) + console.log(` STRONG mean=${best.strong.mean.toFixed(2)}`) + for (const [i, s] of best.strong.samples.entries()) + console.log(` [s${i} score=${s.score.toFixed(2)}] ${oneLine(s.answer).slice(0, 220)}`) + } + // ── 5. The empirical calibration (paper Table 1) ── console.log('\n— Calibration: plain first-draft gap vs agentic loop-accepted gap —') console.log( @@ -118,7 +142,8 @@ async function main(): Promise { } if (result.accepted.length === 0) { console.log( - ' NOTE: 0 examples cleared the discriminative accept bar — the two GLM tiers did not separate.', + ` NOTE: 0 examples cleared the discriminative accept bar — ${WEAK_SOLVER_MODEL} and ` + + `${STRONG_SOLVER_MODEL} did not separate on these questions (see the autopsy trail).`, ) } @@ -142,6 +167,11 @@ async function main(): Promise { ) console.log(`\n— Dataset — ${result.rows.length} row(s) written to ${result.outPath}`) + if (result.attemptsPath) { + console.log( + `— Autopsy trail — ${result.attempts.length} attempt(s) (accepted + rejected) at ${result.attemptsPath}`, + ) + } } main().catch((err) => {