Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 114 additions & 49 deletions docs/results/autodata-live.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,123 @@
# Autodata live result: a false null, autopsied, then a real (clean) null
# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)

Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
solver models, to manufacture training examples that separate a strong solver from a weak one
(the discriminative reward). The headline is a null — but the path to it is the result.

## What happened, in order

1. **First runs looked like a null with a *negative* gap.** Across two tier pairs —
`glm-4.5-air` vs `glm-5.2`, then `groq/llama-3.1-8b-instant` vs `gemini-2.5-pro` — every run
reported 0 accepted and a strong−weak gap *below zero* (plain −0.47, then −1.00). A frontier
model scoring *below* an 8B on reasoning questions is not credible.

2. **Autopsy (a direct probe on the real judge) found an artifact, not a finding.** At the solver's
`maxTokens: 1024`, the strong **reasoning** model (`gemini-2.5-pro`, and `glm-5.2` before it)
spent its whole budget on hidden reasoning and returned **empty visible content** on hard
prompts — which the judge scored 0. So "strong" was being scored 0 for *answering nothing*,
manufacturing a false negative gap. The trivial cost-gate smoke ("reply ok") didn't trigger it,
so it slipped through. (Confirmed: the same prompt at `maxTokens: 8000` → gemini answers in
956 chars and scores 1.00.)

3. **Fix (this PR).** The solver now uses a reasoning-safe `maxTokens` (8000) **and fails loud on
empty content** — an empty answer is a measurement failure, never a silent 0 that corrupts the
gap (the repo's no-silent-zeros rule). The model tier is now an env knob
(`AUTODATA_WEAK_MODEL` / `AUTODATA_STRONG_MODEL` / `…_CHALLENGER_MODEL` / `…_JUDGE_MODEL`), and
the price table covers the wide tier.

4. **The clean result.** Re-run with the fix, `llama-3.1-8b` vs `gemini-2.5-pro`:

| metric | value |
|---|---|
| accepted (discriminating) examples | **0 / 3** |
| plain gap (n=1) | 0.000 |
| refined best-gap per slot (n=3) | 0.006 |
| Δ (refined − plain) | **+0.006 — no meaningful widening** |
| spend | $0.09 |

The gap is now **~0, not negative** — `gemini-2.5-pro` and `llama-3.1-8b` score about **equally**.
solvers, to manufacture training examples that separate a strong solver from a weak one (the
discriminative reward of the Autodata / Agentic-Self-Instruct method).

**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.

## The two levers that turned the null into a positive

The earlier null ("a small model performs as well as a frontier one") had TWO compounding causes,
both fixed here:

1. **The question leaked the answer / asked for recall.** The challenger wrote lookup-style questions
whose answer sat in the provided context, so an 8B read it out as well as a frontier model.
Fix — the **non-extractive causal challenger**: it must author CAUSAL / COMPARATIVE / MECHANISM /
THESIS-CONSISTENCY questions whose answer is DERIVED, the context must hold premises but not state
the conclusion, the solver no longer sees the rubric (the mark scheme), and the judge now sees the
context and scores a dedicated `reasoning` dimension LOW when the answer merely restates it (the
paper's negative criterion). On reject, the fold steers per reason ("too easy" → go non-extractive
and harder; "too hard" → ease; "not discriminative" → sharpen).

2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.

## Setup (all env-overridable)

| role | model | why |
|---|---|---|
| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
| strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
| challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |

Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)

## The judge is reliable (checked before trusting any gap)

A controlled probe scored one genuinely-strong vs one genuinely-weak answer to the same question, 3×
each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs weak `[0.23, 0.13,
0.17]` (mean 0.18) — a consistent **0.82** separation, ranking strong above weak every time. So a
measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
errors — `deepseek` is the better grader here.)

## The result — the gap opens, examples are accepted

**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
reasoning).

**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**

| run | accepted | gap widening (plain → refined) | note |
|---|---|---|---|
| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated |
| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |

**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.

## An autopsied accepted example (real discrimination, both answers read)

> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
> output differ from the intended behavior, and why is this failure mode problematic?

- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
the point of the MoE. Correct.
- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
the failure consequence; it never reaches "all experts averaged → specialization lost."

When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
leakage (the answer is not in the context). **But it does not open reliably.** In the independent
re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
correctly explained that high positional locality routes consecutive tokens to the same expert →
over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
variance that makes acceptance a coin-flip.

## The finding

On these auto-generated, doc-grounded questions a small model performs as well as a frontier one,
because **the answer is extractable from the provided context** — reading beats reasoning, so model
capability does not separate and no example clears the discriminative bar. This is *not* a
model-tier problem (we used a genuine 8B-vs-frontier gap); it is a **question-difficulty** problem.
The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).

The lever is therefore the **challenger**, not the model tier: to open a real gap the challenger must
generate **non-extractive, reasoning-heavy** questions (multi-step derivations, numerical claims that
require following the paper's argument) — which is exactly the move the Autodata paper relies on
("the agent's initial attempt was usually a high-level summary question… subsequent rounds moved the
questions toward specific algorithmic steps the paper's actual argument required"). Our challenger,
on a single section, mostly produces extractable questions. Making it harder is the next experiment.
But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
exact small-n shape that has repeatedly looked positive here and washed out at power.

## Status

Mechanism: proven end-to-end on real frontier models, cost-tracked, fail-loud. Empirical
discrimination: a clean null on extractive questions. The harness is now trustworthy (no empty-→0
artifact); the open lever is challenger difficulty.
Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
direction, not a confirmed win.

## Reproduce

```
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts # causal, default Mixtral doc
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts # recall-vs-causal A/B, same doc
```
58 changes: 50 additions & 8 deletions src/autodata/build-dataset.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,17 @@
* AND for the challenger's FIRST drafts (plain), plus the cost ledger split by role.
*/

import { mkdir, writeFile } from 'node:fs/promises'
import { appendFile, mkdir, rm, writeFile } from 'node:fs/promises'
import { dirname } from 'node:path'
import { CostLedger } from '@tangle-network/agent-eval'
import {
type AttemptRecord,
createDataCreationLoop,
discriminativeAcceptRule,
type ExampleEvaluation,
} from './data-creation-loop'
import { type GroundedDoc, groundDoc } from './grounding'
import { buildAutodataRoles, type RouterCallRecord } from './router-roles'
import { buildAutodataRoles, type ChallengerStyle, type RouterCallRecord } from './router-roles'

export interface DiscriminativeThresholds {
minStrong?: number
Expand All @@ -36,6 +37,10 @@ export interface AutodataDatasetConfig {
maxRetries?: number
thresholds?: DiscriminativeThresholds
models?: { challenger?: string; weak?: string; strong?: string; judge?: string }
/** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */
style?: ChallengerStyle
/** Where to write the per-attempt autopsy JSONL (every candidate, accepted or rejected). */
attemptsPath?: string
signal?: AbortSignal
}

Expand Down Expand Up @@ -65,11 +70,15 @@ export interface AutodataDatasetResult {
plainGaps: number[]
agenticGaps: number[]
refinedGaps: number[]
/** Every evaluated candidate (accepted or rejected) with both solvers' answers — the autopsy trail. */
attempts: AttemptRecord[]
cost: CostLedger
costPerExampleUsd: number | null
/** How many router calls were priced by the router vs rate-estimated. */
callProvenance: { router: number; estimated: number }
outPath: string
/** Where the per-attempt autopsy JSONL was written (null if not requested). */
attemptsPath: string | null
}

function mean(xs: number[]): number | null {
Expand All @@ -80,16 +89,31 @@ function isGrounded(s: AutodataDatasetConfig['source']): s is GroundedDoc {
return typeof (s as GroundedDoc).doc === 'string'
}

function challengerInstruction(doc: string): string {
/** The causal (default) user instruction — pairs with the non-extractive challenger system prompt. */
function causalInstruction(doc: string): string {
return (
`SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
`Write ONE hard exam question grounded in this excerpt. It must require multi-step reasoning ` +
`over the excerpt (a small model should get it wrong, a strong model right), never a verbatim ` +
`lookup. Return STRICT JSON: {"context": string, "question": string, "reference": string, ` +
`"rubric": string[] }.`
`Write ONE hard CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY question grounded in this ` +
`excerpt — never a recall / lookup / definition. The CONTEXT must give the solver the premises ` +
`but MUST NOT state the answer; the answer has to be DERIVED. Return STRICT JSON: ` +
`{"context": string, "question": string, "reference": string, "rubric": string[] }.`
)
}

/** The recall (baseline) user instruction — pairs with the extractive challenger; for calibration. */
function recallInstruction(doc: string): string {
return (
`SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
`Write ONE exam question grounded in this excerpt, with a short context excerpt the question is ` +
`answerable from, a reference answer, and a 2-3 item rubric. Return STRICT JSON: ` +
`{"context": string, "question": string, "reference": string, "rubric": string[] }.`
)
}

function instructionFor(style: ChallengerStyle): (doc: string) => string {
return style === 'recall' ? recallInstruction : causalInstruction
}

/** Run the full pipeline: ground → loop → JSONL. Returns the calibration numbers + cost. */
export async function buildAutodataDataset(
config: AutodataDatasetConfig,
Expand All @@ -110,6 +134,7 @@ export async function buildAutodataDataset(
}

const ledger = new CostLedger()
const style: ChallengerStyle = config.style ?? 'causal'

const roles = buildAutodataRoles({
apiKey: config.apiKey,
Expand All @@ -118,13 +143,27 @@ export async function buildAutodataDataset(
weakModel: config.models?.weak,
strongModel: config.models?.strong,
judgeModel: config.models?.judge,
challengerStyle: style,
ledger,
onCall,
})

// Per-attempt autopsy trail: every candidate (accepted or rejected) is appended as one JSONL row
// with both solvers' answer text + scores, so a null is diagnosable from the raw answers.
const attempts: AttemptRecord[] = []
const attemptsPath = config.attemptsPath ?? null
if (attemptsPath) {
await mkdir(dirname(attemptsPath), { recursive: true })
await rm(attemptsPath, { force: true })
}
const onAttempt = async (rec: AttemptRecord): Promise<void> => {
attempts.push(rec)
if (attemptsPath) await appendFile(attemptsPath, `${JSON.stringify({ ...rec, style })}\n`)
}

const result = await createDataCreationLoop({
doc: source.doc,
baseInstruction: challengerInstruction,
baseInstruction: instructionFor(style),
challenger: roles.challenger,
weakSolver: roles.weakSolver,
strongSolver: roles.strongSolver,
Expand All @@ -134,6 +173,7 @@ export async function buildAutodataDataset(
samples: config.samples ?? 3,
maxRetries: config.maxRetries ?? 4,
cost: ledger,
onAttempt,
signal: config.signal,
})

Expand Down Expand Up @@ -164,9 +204,11 @@ export async function buildAutodataDataset(
plainGaps: result.plainGaps,
agenticGaps: result.agenticGaps,
refinedGaps: result.refinedGaps,
attempts,
cost: result.cost,
costPerExampleUsd: result.cost.costPerCompletedTask(),
callProvenance: provenance,
outPath: config.outPath,
attemptsPath,
}
}
Loading
Loading