tangle-network · drewstone · Jun 26, 2026 · Jun 26, 2026 · Jun 26, 2026
diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md
@@ -1,58 +1,123 @@
-# Autodata live result: a false null, autopsied, then a real (clean) null
+# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)
 
 Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
-solver models, to manufacture training examples that separate a strong solver from a weak one
-(the discriminative reward). The headline is a null — but the path to it is the result.
-
-## What happened, in order
-
-1. **First runs looked like a null with a *negative* gap.** Across two tier pairs —
-   `glm-4.5-air` vs `glm-5.2`, then `groq/llama-3.1-8b-instant` vs `gemini-2.5-pro` — every run
-   reported 0 accepted and a strong−weak gap *below zero* (plain −0.47, then −1.00). A frontier
-   model scoring *below* an 8B on reasoning questions is not credible.
-
-2. **Autopsy (a direct probe on the real judge) found an artifact, not a finding.** At the solver's
-   `maxTokens: 1024`, the strong **reasoning** model (`gemini-2.5-pro`, and `glm-5.2` before it)
-   spent its whole budget on hidden reasoning and returned **empty visible content** on hard
-   prompts — which the judge scored 0. So "strong" was being scored 0 for *answering nothing*,
-   manufacturing a false negative gap. The trivial cost-gate smoke ("reply ok") didn't trigger it,
-   so it slipped through. (Confirmed: the same prompt at `maxTokens: 8000` → gemini answers in
-   956 chars and scores 1.00.)
-
-3. **Fix (this PR).** The solver now uses a reasoning-safe `maxTokens` (8000) **and fails loud on
-   empty content** — an empty answer is a measurement failure, never a silent 0 that corrupts the
-   gap (the repo's no-silent-zeros rule). The model tier is now an env knob
-   (`AUTODATA_WEAK_MODEL` / `AUTODATA_STRONG_MODEL` / `…_CHALLENGER_MODEL` / `…_JUDGE_MODEL`), and
-   the price table covers the wide tier.
-
-4. **The clean result.** Re-run with the fix, `llama-3.1-8b` vs `gemini-2.5-pro`:
-
-   | metric | value |
-   |---|---|
-   | accepted (discriminating) examples | **0 / 3** |
-   | plain gap (n=1) | 0.000 |
-   | refined best-gap per slot (n=3) | 0.006 |
-   | Δ (refined − plain) | **+0.006 — no meaningful widening** |
-   | spend | $0.09 |
-
-   The gap is now **~0, not negative** — `gemini-2.5-pro` and `llama-3.1-8b` score about **equally**.
+solvers, to manufacture training examples that separate a strong solver from a weak one (the
+discriminative reward of the Autodata / Agentic-Self-Instruct method).
+
+**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
+**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
+method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
+∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
+0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
+(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
+**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
+mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.
+
+## The two levers that turned the null into a positive
+
+The earlier null ("a small model performs as well as a frontier one") had TWO compounding causes,
+both fixed here:
+
+1. **The question leaked the answer / asked for recall.** The challenger wrote lookup-style questions
+   whose answer sat in the provided context, so an 8B read it out as well as a frontier model.
+   Fix — the **non-extractive causal challenger**: it must author CAUSAL / COMPARATIVE / MECHANISM /
+   THESIS-CONSISTENCY questions whose answer is DERIVED, the context must hold premises but not state
+   the conclusion, the solver no longer sees the rubric (the mark scheme), and the judge now sees the
+   context and scores a dedicated `reasoning` dimension LOW when the answer merely restates it (the
+   paper's negative criterion). On reject, the fold steers per reason ("too easy" → go non-extractive
+   and harder; "too hard" → ease; "not discriminative" → sharpen).
+
+2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
+   canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
+   pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
+   memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
+   post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.
+
+## Setup (all env-overridable)
+
+| role | model | why |
+|---|---|---|
+| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
+| strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
+| challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
+| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |
+
+Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
+challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
+live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)
+
+## The judge is reliable (checked before trusting any gap)
+
+A controlled probe scored one genuinely-strong vs one genuinely-weak answer to the same question, 3×
+each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs weak `[0.23, 0.13,
+0.17]` (mean 0.18) — a consistent **0.82** separation, ranking strong above weak every time. So a
+measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
+errors — `deepseek` is the better grader here.)
+
+## The result — the gap opens, examples are accepted
+
+**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
+**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
+reasoning).
+
+**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**
+
+| run | accepted | gap widening (plain → refined) | note |
+|---|---|---|---|
+| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
+| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated |
+| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
+
+**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
+accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
+these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
+acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
+
+## An autopsied accepted example (real discrimination, both answers read)
+
+> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
+> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
+> output differ from the intended behavior, and why is this failure mode problematic?
+
+- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
+  uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
+  the point of the MoE. Correct.
+- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
+  the failure consequence; it never reaches "all experts averaged → specialization lost."
+
+When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
+leakage (the answer is not in the context). **But it does not open reliably.** In the independent
+re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
+correctly explained that high positional locality routes consecutive tokens to the same expert →
+over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
+fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
+variance that makes acceptance a coin-flip.
 
 ## The finding
 
-On these auto-generated, doc-grounded questions a small model performs as well as a frontier one,
-because **the answer is extractable from the provided context** — reading beats reasoning, so model
-capability does not separate and no example clears the discriminative bar. This is *not* a
-model-tier problem (we used a genuine 8B-vs-frontier gap); it is a **question-difficulty** problem.
+The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
+(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
+(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
+**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).
 
-The lever is therefore the **challenger**, not the model tier: to open a real gap the challenger must
-generate **non-extractive, reasoning-heavy** questions (multi-step derivations, numerical claims that
-require following the paper's argument) — which is exactly the move the Autodata paper relies on
-("the agent's initial attempt was usually a high-level summary question… subsequent rounds moved the
-questions toward specific algorithmic steps the paper's actual argument required"). Our challenger,
-on a single section, mostly produces extractable questions. Making it harder is the next experiment.
+But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
+*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
+MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
+coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
+exact small-n shape that has repeatedly looked positive here and washed out at power.
 
 ## Status
 
-Mechanism: proven end-to-end on real frontier models, cost-tracked, fail-loud. Empirical
-discrimination: a clean null on extractive questions. The harness is now trustworthy (no empty-→0
-artifact); the open lever is challenger difficulty.
+Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
+dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
+Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
+`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
+*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
+direction, not a confirmed win.
+
+## Reproduce
+
+```
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts        # causal, default Mixtral doc
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts  # recall-vs-causal A/B, same doc
+```
diff --git a/src/autodata/build-dataset.ts b/src/autodata/build-dataset.ts
@@ -7,16 +7,17 @@
  * AND for the challenger's FIRST drafts (plain), plus the cost ledger split by role.
  */
 
-import { mkdir, writeFile } from 'node:fs/promises'
+import { appendFile, mkdir, rm, writeFile } from 'node:fs/promises'
 import { dirname } from 'node:path'
 import { CostLedger } from '@tangle-network/agent-eval'
 import {
+  type AttemptRecord,
   createDataCreationLoop,
   discriminativeAcceptRule,
   type ExampleEvaluation,
 } from './data-creation-loop'
 import { type GroundedDoc, groundDoc } from './grounding'
-import { buildAutodataRoles, type RouterCallRecord } from './router-roles'
+import { buildAutodataRoles, type ChallengerStyle, type RouterCallRecord } from './router-roles'
 
 export interface DiscriminativeThresholds {
   minStrong?: number
@@ -36,6 +37,10 @@ export interface AutodataDatasetConfig {
   maxRetries?: number
   thresholds?: DiscriminativeThresholds
   models?: { challenger?: string; weak?: string; strong?: string; judge?: string }
+  /** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */
+  style?: ChallengerStyle
+  /** Where to write the per-attempt autopsy JSONL (every candidate, accepted or rejected). */
+  attemptsPath?: string
   signal?: AbortSignal
 }
 
@@ -65,11 +70,15 @@ export interface AutodataDatasetResult {
   plainGaps: number[]
   agenticGaps: number[]
   refinedGaps: number[]
+  /** Every evaluated candidate (accepted or rejected) with both solvers' answers — the autopsy trail. */
+  attempts: AttemptRecord[]
   cost: CostLedger
   costPerExampleUsd: number | null
   /** How many router calls were priced by the router vs rate-estimated. */
   callProvenance: { router: number; estimated: number }
   outPath: string
+  /** Where the per-attempt autopsy JSONL was written (null if not requested). */
+  attemptsPath: string | null
 }
 
 function mean(xs: number[]): number | null {
@@ -80,16 +89,31 @@ function isGrounded(s: AutodataDatasetConfig['source']): s is GroundedDoc {
   return typeof (s as GroundedDoc).doc === 'string'
 }
 
-function challengerInstruction(doc: string): string {
+/** The causal (default) user instruction — pairs with the non-extractive challenger system prompt. */
+function causalInstruction(doc: string): string {
   return (
     `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
-    `Write ONE hard exam question grounded in this excerpt. It must require multi-step reasoning ` +
-    `over the excerpt (a small model should get it wrong, a strong model right), never a verbatim ` +
-    `lookup. Return STRICT JSON: {"context": string, "question": string, "reference": string, ` +
-    `"rubric": string[] }.`
+    `Write ONE hard CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY question grounded in this ` +
+    `excerpt — never a recall / lookup / definition. The CONTEXT must give the solver the premises ` +
+    `but MUST NOT state the answer; the answer has to be DERIVED. Return STRICT JSON: ` +
+    `{"context": string, "question": string, "reference": string, "rubric": string[] }.`
   )
 }
 
+/** The recall (baseline) user instruction — pairs with the extractive challenger; for calibration. */
+function recallInstruction(doc: string): string {
+  return (
+    `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
+    `Write ONE exam question grounded in this excerpt, with a short context excerpt the question is ` +
+    `answerable from, a reference answer, and a 2-3 item rubric. Return STRICT JSON: ` +
+    `{"context": string, "question": string, "reference": string, "rubric": string[] }.`
+  )
+}
+
+function instructionFor(style: ChallengerStyle): (doc: string) => string {
+  return style === 'recall' ? recallInstruction : causalInstruction
+}
+
 /** Run the full pipeline: ground → loop → JSONL. Returns the calibration numbers + cost. */
 export async function buildAutodataDataset(
   config: AutodataDatasetConfig,
@@ -110,6 +134,7 @@ export async function buildAutodataDataset(
   }
 
   const ledger = new CostLedger()
+  const style: ChallengerStyle = config.style ?? 'causal'
 
   const roles = buildAutodataRoles({
     apiKey: config.apiKey,
@@ -118,13 +143,27 @@ export async function buildAutodataDataset(
     weakModel: config.models?.weak,
     strongModel: config.models?.strong,
     judgeModel: config.models?.judge,
+    challengerStyle: style,
     ledger,
     onCall,
   })
 
+  // Per-attempt autopsy trail: every candidate (accepted or rejected) is appended as one JSONL row
+  // with both solvers' answer text + scores, so a null is diagnosable from the raw answers.
+  const attempts: AttemptRecord[] = []
+  const attemptsPath = config.attemptsPath ?? null
+  if (attemptsPath) {
+    await mkdir(dirname(attemptsPath), { recursive: true })
+    await rm(attemptsPath, { force: true })
+  }
+  const onAttempt = async (rec: AttemptRecord): Promise<void> => {
+    attempts.push(rec)
+    if (attemptsPath) await appendFile(attemptsPath, `${JSON.stringify({ ...rec, style })}\n`)
+  }
+
   const result = await createDataCreationLoop({
     doc: source.doc,
-    baseInstruction: challengerInstruction,
+    baseInstruction: instructionFor(style),
     challenger: roles.challenger,
     weakSolver: roles.weakSolver,
     strongSolver: roles.strongSolver,
@@ -134,6 +173,7 @@ export async function buildAutodataDataset(
     samples: config.samples ?? 3,
     maxRetries: config.maxRetries ?? 4,
     cost: ledger,
+    onAttempt,
     signal: config.signal,
   })
 
@@ -164,9 +204,11 @@ export async function buildAutodataDataset(
     plainGaps: result.plainGaps,
     agenticGaps: result.agenticGaps,
     refinedGaps: result.refinedGaps,
+    attempts,
     cost: result.cost,
     costPerExampleUsd: result.cost.costPerCompletedTask(),
     callProvenance: provenance,
     outPath: config.outPath,
+    attemptsPath,
   }
 }