diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md
index 3746b36..0847d25 100644
--- a/docs/results/autodata-live.md
+++ b/docs/results/autodata-live.md
@@ -1,58 +1,123 @@
-# Autodata live result: a false null, autopsied, then a real (clean) null
+# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)
 
 Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
-solver models, to manufacture training examples that separate a strong solver from a weak one
-(the discriminative reward). The headline is a null — but the path to it is the result.
-
-## What happened, in order
-
-1. **First runs looked like a null with a *negative* gap.** Across two tier pairs —
-   `glm-4.5-air` vs `glm-5.2`, then `groq/llama-3.1-8b-instant` vs `gemini-2.5-pro` — every run
-   reported 0 accepted and a strong−weak gap *below zero* (plain −0.47, then −1.00). A frontier
-   model scoring *below* an 8B on reasoning questions is not credible.
-
-2. **Autopsy (a direct probe on the real judge) found an artifact, not a finding.** At the solver's
-   `maxTokens: 1024`, the strong **reasoning** model (`gemini-2.5-pro`, and `glm-5.2` before it)
-   spent its whole budget on hidden reasoning and returned **empty visible content** on hard
-   prompts — which the judge scored 0. So "strong" was being scored 0 for *answering nothing*,
-   manufacturing a false negative gap. The trivial cost-gate smoke ("reply ok") didn't trigger it,
-   so it slipped through. (Confirmed: the same prompt at `maxTokens: 8000` → gemini answers in
-   956 chars and scores 1.00.)
-
-3. **Fix (this PR).** The solver now uses a reasoning-safe `maxTokens` (8000) **and fails loud on
-   empty content** — an empty answer is a measurement failure, never a silent 0 that corrupts the
-   gap (the repo's no-silent-zeros rule). The model tier is now an env knob
-   (`AUTODATA_WEAK_MODEL` / `AUTODATA_STRONG_MODEL` / `…_CHALLENGER_MODEL` / `…_JUDGE_MODEL`), and
-   the price table covers the wide tier.
-
-4. **The clean result.** Re-run with the fix, `llama-3.1-8b` vs `gemini-2.5-pro`:
-
-   | metric | value |
-   |---|---|
-   | accepted (discriminating) examples | **0 / 3** |
-   | plain gap (n=1) | 0.000 |
-   | refined best-gap per slot (n=3) | 0.006 |
-   | Δ (refined − plain) | **+0.006 — no meaningful widening** |
-   | spend | $0.09 |
-
-   The gap is now **~0, not negative** — `gemini-2.5-pro` and `llama-3.1-8b` score about **equally**.
+solvers, to manufacture training examples that separate a strong solver from a weak one (the
+discriminative reward of the Autodata / Agentic-Self-Instruct method).
+
+**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
+**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
+method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
+∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
+0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
+(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
+**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
+mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.
+
+## The two levers that turned the null into a positive
+
+The earlier null ("a small model performs as well as a frontier one") had TWO compounding causes,
+both fixed here:
+
+1. **The question leaked the answer / asked for recall.** The challenger wrote lookup-style questions
+   whose answer sat in the provided context, so an 8B read it out as well as a frontier model.
+   Fix — the **non-extractive causal challenger**: it must author CAUSAL / COMPARATIVE / MECHANISM /
+   THESIS-CONSISTENCY questions whose answer is DERIVED, the context must hold premises but not state
+   the conclusion, the solver no longer sees the rubric (the mark scheme), and the judge now sees the
+   context and scores a dedicated `reasoning` dimension LOW when the answer merely restates it (the
+   paper's negative criterion). On reject, the fold steers per reason ("too easy" → go non-extractive
+   and harder; "too hard" → ease; "not discriminative" → sharpen).
+
+2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
+   canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
+   pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
+   memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
+   post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.
+
+## Setup (all env-overridable)
+
+| role | model | why |
+|---|---|---|
+| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
+| strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
+| challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
+| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |
+
+Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
+challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
+live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)
+
+## The judge is reliable (checked before trusting any gap)
+
+A controlled probe scored one genuinely-strong vs one genuinely-weak answer to the same question, 3×
+each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs weak `[0.23, 0.13,
+0.17]` (mean 0.18) — a consistent **0.82** separation, ranking strong above weak every time. So a
+measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
+errors — `deepseek` is the better grader here.)
+
+## The result — the gap opens, examples are accepted
+
+**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
+**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
+reasoning).
+
+**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**
+
+| run | accepted | gap widening (plain → refined) | note |
+|---|---|---|---|
+| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
+| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated |
+| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
+
+**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
+accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
+these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
+acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
+
+## An autopsied accepted example (real discrimination, both answers read)
+
+> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
+> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
+> output differ from the intended behavior, and why is this failure mode problematic?
+
+- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
+  uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
+  the point of the MoE. Correct.
+- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
+  the failure consequence; it never reaches "all experts averaged → specialization lost."
+
+When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
+leakage (the answer is not in the context). **But it does not open reliably.** In the independent
+re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
+correctly explained that high positional locality routes consecutive tokens to the same expert →
+over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
+fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
+variance that makes acceptance a coin-flip.
 
 ## The finding
 
-On these auto-generated, doc-grounded questions a small model performs as well as a frontier one,
-because **the answer is extractable from the provided context** — reading beats reasoning, so model
-capability does not separate and no example clears the discriminative bar. This is *not* a
-model-tier problem (we used a genuine 8B-vs-frontier gap); it is a **question-difficulty** problem.
+The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
+(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
+(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
+**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).
 
-The lever is therefore the **challenger**, not the model tier: to open a real gap the challenger must
-generate **non-extractive, reasoning-heavy** questions (multi-step derivations, numerical claims that
-require following the paper's argument) — which is exactly the move the Autodata paper relies on
-("the agent's initial attempt was usually a high-level summary question… subsequent rounds moved the
-questions toward specific algorithmic steps the paper's actual argument required"). Our challenger,
-on a single section, mostly produces extractable questions. Making it harder is the next experiment.
+But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
+*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
+MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
+coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
+exact small-n shape that has repeatedly looked positive here and washed out at power.
 
 ## Status
 
-Mechanism: proven end-to-end on real frontier models, cost-tracked, fail-loud. Empirical
-discrimination: a clean null on extractive questions. The harness is now trustworthy (no empty-→0
-artifact); the open lever is challenger difficulty.
+Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
+dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
+Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
+`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
+*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
+direction, not a confirmed win.
+
+## Reproduce
+
+```
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts        # causal, default Mixtral doc
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts  # recall-vs-causal A/B, same doc
+```
diff --git a/src/autodata/build-dataset.ts b/src/autodata/build-dataset.ts
index a2e1853..9f6cd68 100644
--- a/src/autodata/build-dataset.ts
+++ b/src/autodata/build-dataset.ts
@@ -7,16 +7,17 @@
  * AND for the challenger's FIRST drafts (plain), plus the cost ledger split by role.
  */
 
-import { mkdir, writeFile } from 'node:fs/promises'
+import { appendFile, mkdir, rm, writeFile } from 'node:fs/promises'
 import { dirname } from 'node:path'
 import { CostLedger } from '@tangle-network/agent-eval'
 import {
+  type AttemptRecord,
   createDataCreationLoop,
   discriminativeAcceptRule,
   type ExampleEvaluation,
 } from './data-creation-loop'
 import { type GroundedDoc, groundDoc } from './grounding'
-import { buildAutodataRoles, type RouterCallRecord } from './router-roles'
+import { buildAutodataRoles, type ChallengerStyle, type RouterCallRecord } from './router-roles'
 
 export interface DiscriminativeThresholds {
   minStrong?: number
@@ -36,6 +37,10 @@ export interface AutodataDatasetConfig {
   maxRetries?: number
   thresholds?: DiscriminativeThresholds
   models?: { challenger?: string; weak?: string; strong?: string; judge?: string }
+  /** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */
+  style?: ChallengerStyle
+  /** Where to write the per-attempt autopsy JSONL (every candidate, accepted or rejected). */
+  attemptsPath?: string
   signal?: AbortSignal
 }
 
@@ -65,11 +70,15 @@ export interface AutodataDatasetResult {
   plainGaps: number[]
   agenticGaps: number[]
   refinedGaps: number[]
+  /** Every evaluated candidate (accepted or rejected) with both solvers' answers — the autopsy trail. */
+  attempts: AttemptRecord[]
   cost: CostLedger
   costPerExampleUsd: number | null
   /** How many router calls were priced by the router vs rate-estimated. */
   callProvenance: { router: number; estimated: number }
   outPath: string
+  /** Where the per-attempt autopsy JSONL was written (null if not requested). */
+  attemptsPath: string | null
 }
 
 function mean(xs: number[]): number | null {
@@ -80,16 +89,31 @@ function isGrounded(s: AutodataDatasetConfig['source']): s is GroundedDoc {
   return typeof (s as GroundedDoc).doc === 'string'
 }
 
-function challengerInstruction(doc: string): string {
+/** The causal (default) user instruction — pairs with the non-extractive challenger system prompt. */
+function causalInstruction(doc: string): string {
   return (
     `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
-    `Write ONE hard exam question grounded in this excerpt. It must require multi-step reasoning ` +
-    `over the excerpt (a small model should get it wrong, a strong model right), never a verbatim ` +
-    `lookup. Return STRICT JSON: {"context": string, "question": string, "reference": string, ` +
-    `"rubric": string[] }.`
+    `Write ONE hard CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY question grounded in this ` +
+    `excerpt — never a recall / lookup / definition. The CONTEXT must give the solver the premises ` +
+    `but MUST NOT state the answer; the answer has to be DERIVED. Return STRICT JSON: ` +
+    `{"context": string, "question": string, "reference": string, "rubric": string[] }.`
   )
 }
 
+/** The recall (baseline) user instruction — pairs with the extractive challenger; for calibration. */
+function recallInstruction(doc: string): string {
+  return (
+    `SOURCE DOCUMENT EXCERPT:\n\n${doc}\n\n` +
+    `Write ONE exam question grounded in this excerpt, with a short context excerpt the question is ` +
+    `answerable from, a reference answer, and a 2-3 item rubric. Return STRICT JSON: ` +
+    `{"context": string, "question": string, "reference": string, "rubric": string[] }.`
+  )
+}
+
+function instructionFor(style: ChallengerStyle): (doc: string) => string {
+  return style === 'recall' ? recallInstruction : causalInstruction
+}
+
 /** Run the full pipeline: ground → loop → JSONL. Returns the calibration numbers + cost. */
 export async function buildAutodataDataset(
   config: AutodataDatasetConfig,
@@ -110,6 +134,7 @@ export async function buildAutodataDataset(
   }
 
   const ledger = new CostLedger()
+  const style: ChallengerStyle = config.style ?? 'causal'
 
   const roles = buildAutodataRoles({
     apiKey: config.apiKey,
@@ -118,13 +143,27 @@ export async function buildAutodataDataset(
     weakModel: config.models?.weak,
     strongModel: config.models?.strong,
     judgeModel: config.models?.judge,
+    challengerStyle: style,
     ledger,
     onCall,
   })
 
+  // Per-attempt autopsy trail: every candidate (accepted or rejected) is appended as one JSONL row
+  // with both solvers' answer text + scores, so a null is diagnosable from the raw answers.
+  const attempts: AttemptRecord[] = []
+  const attemptsPath = config.attemptsPath ?? null
+  if (attemptsPath) {
+    await mkdir(dirname(attemptsPath), { recursive: true })
+    await rm(attemptsPath, { force: true })
+  }
+  const onAttempt = async (rec: AttemptRecord): Promise<void> => {
+    attempts.push(rec)
+    if (attemptsPath) await appendFile(attemptsPath, `${JSON.stringify({ ...rec, style })}\n`)
+  }
+
   const result = await createDataCreationLoop({
     doc: source.doc,
-    baseInstruction: challengerInstruction,
+    baseInstruction: instructionFor(style),
     challenger: roles.challenger,
     weakSolver: roles.weakSolver,
     strongSolver: roles.strongSolver,
@@ -134,6 +173,7 @@ export async function buildAutodataDataset(
     samples: config.samples ?? 3,
     maxRetries: config.maxRetries ?? 4,
     cost: ledger,
+    onAttempt,
     signal: config.signal,
   })
 
@@ -164,9 +204,11 @@ export async function buildAutodataDataset(
     plainGaps: result.plainGaps,
     agenticGaps: result.agenticGaps,
     refinedGaps: result.refinedGaps,
+    attempts,
     cost: result.cost,
     costPerExampleUsd: result.cost.costPerCompletedTask(),
     callProvenance: provenance,
     outPath: config.outPath,
+    attemptsPath,
   }
 }
diff --git a/src/autodata/calibrate.ts b/src/autodata/calibrate.ts
new file mode 100644
index 0000000..b024aa7
--- /dev/null
+++ b/src/autodata/calibrate.ts
@@ -0,0 +1,127 @@
+/**
+ * Autodata calibration: does the non-extractive CAUSAL challenger widen the strong/weak gap vs the
+ * old RECALL challenger, on the SAME grounded document? This is the lever's proof — the prior null
+ * was "recall question → answer leaks into context → an 8B reads it out → gap ~0". If the fix works,
+ * the causal arm's mean gap (and accepted count) clears the recall arm's by a clear margin.
+ *
+ * Run (key never printed):
+ *   dotenvx run -f /home/drew/company/devops/secrets/agent-state.env -- \
+ *     pnpm tsx src/autodata/calibrate.ts
+ *
+ * Env knobs: AUTODATA_URL, AUTODATA_FOCUS, AUTODATA_TARGET, AUTODATA_SAMPLES, AUTODATA_MAXRETRIES,
+ *            AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY.
+ */
+
+import { type AutodataDatasetResult, buildAutodataDataset } from './build-dataset'
+import type { AttemptRecord } from './data-creation-loop'
+import { DEFAULT_SOURCE_URL, groundDoc } from './grounding'
+import {
+  CHALLENGER_MODEL,
+  type ChallengerStyle,
+  JUDGE_MODEL,
+  STRONG_SOLVER_MODEL,
+  smokeTestModels,
+  WEAK_SOLVER_MODEL,
+} from './router-roles'
+
+function envInt(name: string, fallback: number): number {
+  const raw = process.env[name]
+  if (!raw) return fallback
+  const n = Number.parseInt(raw, 10)
+  if (!Number.isFinite(n) || n <= 0) throw new Error(`${name}='${raw}' is not a positive integer`)
+  return n
+}
+
+function mean(xs: number[]): number | null {
+  return xs.length === 0 ? null : xs.reduce((a, b) => a + b, 0) / xs.length
+}
+
+function fmt(x: number | null, d = 3): string {
+  return x === null ? 'n/a' : x.toFixed(d)
+}
+
+/** Mean strong/weak gap over the quality-clean attempts of one arm — the discriminating power. */
+function armGap(attempts: AttemptRecord[]): number | null {
+  return mean(attempts.filter((a) => a.qualityOk).map((a) => a.gap))
+}
+
+async function runArm(args: {
+  apiKey: string
+  source: Awaited<ReturnType<typeof groundDoc>>
+  style: ChallengerStyle
+  target: number
+  samples: number
+  maxRetries: number
+}): Promise<AutodataDatasetResult> {
+  return buildAutodataDataset({
+    apiKey: args.apiKey,
+    source: args.source,
+    outPath: `data/autodata-calib-${args.style}.jsonl`,
+    attemptsPath: `data/autodata-calib-${args.style}-attempts.jsonl`,
+    style: args.style,
+    target: args.target,
+    samples: args.samples,
+    maxRetries: args.maxRetries,
+  })
+}
+
+async function main(): Promise<void> {
+  const apiKey = process.env.TANGLE_API_KEY ?? process.env.TANGLE_ROUTER_KEY
+  if (!apiKey) throw new Error('no TANGLE_API_KEY in env — run under dotenvx so the key is set')
+
+  const url = process.env.AUTODATA_URL ?? DEFAULT_SOURCE_URL
+  const focus = process.env.AUTODATA_FOCUS ?? 'attention'
+  const target = envInt('AUTODATA_TARGET', 2)
+  const samples = envInt('AUTODATA_SAMPLES', 2)
+  const maxRetries = envInt('AUTODATA_MAXRETRIES', 2)
+
+  console.log('Autodata calibration · recall vs causal challenger (same doc)\n')
+  console.log(
+    `  challenger/judge=${CHALLENGER_MODEL}/${JUDGE_MODEL}  weak=${WEAK_SOLVER_MODEL}  strong=${STRONG_SOLVER_MODEL}`,
+  )
+
+  const smoke = await smokeTestModels({
+    apiKey,
+    models: [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL],
+  })
+  const dead = smoke.filter((s) => !s.ok)
+  if (dead.length > 0)
+    throw new Error(`cost gate failed — empty from: ${dead.map((d) => d.model).join(', ')}`)
+  console.log('  cost gate ok (all models returned content)\n')
+
+  const source = await groundDoc({ url, focus })
+  console.log(`Grounded on ${source.url}  section='${source.headingPath}'\n`)
+
+  // Same doc, same budget, only the challenger prompt changes — a clean A/B on the lever.
+  const recall = await runArm({ apiKey, source, style: 'recall', target, samples, maxRetries })
+  const causal = await runArm({ apiKey, source, style: 'causal', target, samples, maxRetries })
+
+  const recallGap = armGap(recall.attempts)
+  const causalGap = armGap(causal.attempts)
+
+  console.log('— Calibration result —')
+  console.log(
+    `  RECALL  attempts=${recall.attempts.length}  mean gap=${fmt(recallGap)}  accepted=${recall.accepted.length}`,
+  )
+  console.log(
+    `  CAUSAL  attempts=${causal.attempts.length}  mean gap=${fmt(causalGap)}  accepted=${causal.accepted.length}`,
+  )
+  if (recallGap !== null && causalGap !== null) {
+    const delta = causalGap - recallGap
+    console.log(
+      `  Δ (causal − recall) = ${delta >= 0 ? '+' : ''}${delta.toFixed(3)}  ` +
+        (delta >= 0.1
+          ? '→ the causal challenger WIDENS the gap (the lever works)'
+          : '→ no meaningful widening from the causal challenger (honest null)'),
+    )
+  }
+
+  const totalUsd = recall.cost.summary().totalCostUsd + causal.cost.summary().totalCostUsd
+  console.log(`\n  spend: $${totalUsd.toFixed(4)} (both arms)`)
+  console.log(`  trails: ${recall.attemptsPath}  ${causal.attemptsPath}`)
+}
+
+main().catch((err) => {
+  console.error(err)
+  process.exit(1)
+})
diff --git a/src/autodata/data-creation-loop.test.ts b/src/autodata/data-creation-loop.test.ts
index df917cc..ec42325 100644
--- a/src/autodata/data-creation-loop.test.ts
+++ b/src/autodata/data-creation-loop.test.ts
@@ -1,5 +1,6 @@
 import { describe, expect, it } from 'vitest'
 import {
+  type AttemptRecord,
   createDataCreationLoop,
   discriminativeAcceptRule,
   qualityCheck,
@@ -127,4 +128,45 @@ describe('createDataCreationLoop (offline)', () => {
     expect(stored).toHaveLength(2)
     expect(result.cost.summary().totalCostUsd).toBeGreaterThan(0)
   })
+
+  it('emits a per-attempt record (with both solvers’ answers) for every candidate, accepted or rejected', async () => {
+    const attempts: AttemptRecord[] = []
+    const result = await createDataCreationLoop({
+      doc: groundingDoc,
+      baseInstruction,
+      challenger: challengerClient(),
+      weakSolver: solverClient('weak'),
+      strongSolver: solverClient('strong'),
+      judge: buildRubricJudge(),
+      target: 1,
+      samples: 3,
+      maxRetries: 4,
+      onAttempt: (rec) => {
+        attempts.push(rec)
+      },
+    })
+
+    // The first slot's first draft is the EASY example (rejected "too easy"), then the fold steers
+    // to a HARD example (accepted) — so we observe at least one reject AND one accept.
+    expect(attempts.length).toBeGreaterThanOrEqual(2)
+    const rejected = attempts.filter((a) => !a.decision.accept)
+    const accepted = attempts.filter((a) => a.decision.accept)
+    expect(rejected.length).toBeGreaterThanOrEqual(1)
+    expect(accepted.length).toBeGreaterThanOrEqual(1)
+    expect(result.accepted).toHaveLength(1)
+
+    // Every emitted attempt carries the raw answers + scores the autopsy needs.
+    for (const a of attempts) {
+      expect(a.weak.samples).toHaveLength(3)
+      expect(a.strong.samples).toHaveLength(3)
+      expect(typeof a.weak.samples[0]?.answer).toBe('string')
+      expect(a.gap).toBeCloseTo(a.strong.mean - a.weak.mean, 6)
+    }
+
+    // The first attempt (iteration 0) is the plain draft and should NOT discriminate; a later
+    // attempt should — the fold widened the gap.
+    const plain = attempts.find((a) => a.iteration === 0)
+    expect(plain?.decision.accept).toBe(false)
+    expect(Math.max(...attempts.map((a) => a.gap))).toBeGreaterThan(plain?.gap ?? 0)
+  })
 })
diff --git a/src/autodata/data-creation-loop.ts b/src/autodata/data-creation-loop.ts
index 9022cbb..4d026a7 100644
--- a/src/autodata/data-creation-loop.ts
+++ b/src/autodata/data-creation-loop.ts
@@ -168,12 +168,26 @@ const solverOutput: OutputAdapter<{ answer: string }> = {
   },
 }
 
+/** One solver attempt's recorded answer + judge score — the unit of the per-example autopsy dump. */
+export interface SolverSample {
+  readonly answer: string
+  readonly score: number
+  readonly notes?: string
+}
+
+/** A solver's N× sampled result: the variance-reduced mean the accept rule compares + the raw samples. */
+export interface SolverEval {
+  readonly mean: number
+  readonly samples: readonly SolverSample[]
+}
+
 // ── N× solver sampling = an inline FANOUT driver over runLoop ────────────────────────────────
 //
 // A "round" returns N independent solver tasks (no fold between them) → the kernel runs all N,
 // the `llmJudge`-as-validator scores each against the rubric, and we AVERAGE the N scores (the
 // variance-reduced estimate the accept rule compares — not argmax). runLoop already aggregated
-// the N calls' cost, so we roll its total into the ledger under this solver's channel.
+// the N calls' cost, so we roll its total into the ledger under this solver's channel. Each sample's
+// ANSWER TEXT and score are captured (not just the mean) so a null is autopsy-able per example.
 async function sampleSolverScore(args: {
   solver: SandboxClient
   solverSpec: AgentRunSpec<SolverTask>
@@ -183,9 +197,10 @@ async function sampleSolverScore(args: {
   channel: string
   ledger: CostLedger
   signal?: AbortSignal
-}): Promise<number> {
+}): Promise<SolverEval> {
   const { solver, solverSpec, example, judge, samples, channel, ledger } = args
 
+  const collected: SolverSample[] = []
   const validator: Validator<{ answer: string }> = {
     async validate(out, ctx) {
       const score = await judge.score({
@@ -193,6 +208,7 @@ async function sampleSolverScore(args: {
         scenario: solveScenario,
         signal: ctx.signal,
       })
+      collected.push({ answer: out.answer, score: score.composite, notes: score.notes })
       return { valid: !score.failed, score: score.composite, notes: score.notes }
     },
   }
@@ -225,16 +241,52 @@ async function sampleSolverScore(args: {
     tags: { role: channel },
   })
 
-  const scored = result.iterations.filter((it) => it.verdict).map((it) => it.verdict?.score ?? 0)
-  if (scored.length === 0)
+  if (collected.length === 0)
     throw new Error(`${channel}: every solver sample errored — no score to average`)
-  return scored.reduce((a, b) => a + b, 0) / scored.length
+  const mean = collected.reduce((a, b) => a + b.score, 0) / collected.length
+  return { mean, samples: collected }
 }
 
 // ── The challenger refine driver — the FOLD ──────────────────────────────────────────────────
 
 type ChallengerDecision = 'refine' | 'accept' | 'reject'
 
+/**
+ * The steer the fold appends per reject reason. The accept rule and this map are a matched pair:
+ * "too easy" almost always means the answer leaked into the context (recall), so the steer is to go
+ * non-extractive; "too hard" eases the derivation; "not discriminative" sharpens the contrast.
+ */
+function foldGuidance(why: string): string {
+  if (/leaked/i.test(why)) {
+    return (
+      'The reference answer leaked into the context. Rewrite so the CONTEXT holds ONLY the premises ' +
+      'and the answer must be DERIVED, never quoted.'
+    )
+  }
+  if (/too easy/i.test(why)) {
+    return (
+      'The weak solver scored too high — the answer is extractable from the context (recall / ' +
+      'leakage). Ask a CAUSAL or COMPARATIVE question whose answer is NOT stated in the context and ' +
+      'must be derived, and delete any sentence from the context that states the conclusion.'
+    )
+  }
+  if (/too hard/i.test(why)) {
+    return (
+      'Even the strong solver missed it — ease it. Keep it causal, but shorten the required ' +
+      'reasoning chain and make sure EVERY premise needed to derive the answer is present in the ' +
+      'context (without stating the conclusion).'
+    )
+  }
+  if (/not discriminative/i.test(why)) {
+    return (
+      'Both solvers scored similarly — sharpen the contrast: the question must need a multi-step ' +
+      'derivation a small model gets wrong, while keeping every premise a strong model needs in the ' +
+      'context.'
+    )
+  }
+  return 'Write a new example that fixes exactly the stated problem.'
+}
+
 function challengerDriver(
   maxRetries: number,
   baseInstruction: (doc: string) => string,
@@ -247,9 +299,9 @@ function challengerDriver(
       if (last?.verdict?.valid) return [] // accepted → stop
       if (history.length >= maxRetries) return [] // out of budget → stop
       // THE FOLD: read WHY the last example was rejected and rewrite the instruction to target it.
-      // "too easy" → make it harder; "too hard" → ease it; "leaked" → keep the answer out of context.
+      // Each accept-rule reason maps to a specific steer toward "just right".
       const why = last?.verdict?.notes ?? 'rejected'
-      const prompt = `${baseInstruction(task.doc)}\n\nYour previous example was REJECTED: ${why}. Write a new example that fixes exactly that.`
+      const prompt = `${baseInstruction(task.doc)}\n\nYour previous example was REJECTED: ${why}.\n${foldGuidance(why)}`
       return [{ ...task, prompt }]
     },
     decide(history) {
@@ -269,6 +321,26 @@ export interface ExampleEvaluation {
   readonly decision: AcceptDecision
 }
 
+/**
+ * One fully-evaluated challenger attempt — emitted to `onAttempt` for EVERY candidate, accepted or
+ * rejected. The whole point is diagnosability: a null is only a finding if you can read the actual
+ * answers and see WHY the gap didn't open (weak read it out of the context, judge couldn't separate,
+ * strong erred, …). This carries the answer text both solvers produced, not just the scalar scores.
+ */
+export interface AttemptRecord {
+  /** Which target slot (outer loop index). */
+  readonly slotIndex: number
+  /** Which refine iteration within the slot (0 = first draft / plain). */
+  readonly iteration: number
+  readonly example: DataExample
+  readonly weak: SolverEval
+  readonly strong: SolverEval
+  readonly gap: number
+  readonly decision: AcceptDecision
+  /** Whether the deterministic quality gate (no leak, real rubric) passed before solving. */
+  readonly qualityOk: boolean
+}
+
 // ── The loop ────────────────────────────────────────────────────────────────────────────────
 
 export interface DataCreationConfig {
@@ -301,16 +373,27 @@ export interface DataCreationConfig {
   readonly corpus?: Corpus
   /** Cost ledger to record into. Default a fresh `CostLedger`. */
   readonly cost?: CostLedger
+  /**
+   * Observability hook: fired once per evaluated candidate (accepted OR rejected) with both solvers'
+   * answer text + scores. Wire a JSONL writer here so every null is autopsy-able. Errors are not
+   * swallowed — a throwing observer fails the run loud.
+   */
+  readonly onAttempt?: (rec: AttemptRecord) => void | Promise<void>
   readonly signal?: AbortSignal
 }
 
-/** Default solver prompt: ground the answer in the context, score against the numbered rubric. */
+/**
+ * Default solver prompt: the premises + the question — never the rubric. The rubric is the grading
+ * key; showing it to the solver hands the weak model the mark scheme and closes the very gap the
+ * discriminative reward is trying to open. The answer is not stated in the context (the challenger
+ * withholds the conclusion), so the prompt tells the solver to DERIVE it.
+ */
 function defaultRenderSolverPrompt(example: DataExample, sampleIndex: number): string {
   return (
-    `Answer the QUESTION using only the CONTEXT.\n\n` +
+    `Answer the QUESTION by reasoning from the CONTEXT. The answer is NOT stated verbatim in the ` +
+    `context — you must derive it.\n\n` +
     `CONTEXT:\n${example.context}\n\n` +
-    `QUESTION:\n${example.question}\n\n` +
-    `RUBRIC (you are graded on each):\n${example.rubric.map((r, i) => `${i + 1}. ${r}`).join('\n')}\n` +
+    `QUESTION:\n${example.question}\n` +
     `[sample ${sampleIndex}]`
   )
 }
@@ -370,10 +453,11 @@ export async function createDataCreationLoop(
     // apply the accept rule. It stashes each iteration's evaluation so the loop can read back the
     // ACCEPTED one (the agentic arm) and the FIRST draft (the plain calibration baseline).
     const evaluations = new Map<number, ExampleEvaluation>()
+    const emptyEval: SolverEval = { mean: 0, samples: [] }
     const validator: Validator<DataExample> = {
       async validate(example, ctx) {
         const quality = qualityCheck(example)
-        const weakScore = quality.ok
+        const weak = quality.ok
           ? await sampleSolverScore({
               solver: config.weakSolver,
               solverSpec: weakSolverSpec,
@@ -384,8 +468,8 @@ export async function createDataCreationLoop(
               ledger: cost,
               signal: ctx.signal,
             })
-          : 0
-        const strongScore = quality.ok
+          : emptyEval
+        const strong = quality.ok
           ? await sampleSolverScore({
               solver: config.strongSolver,
               solverSpec: strongSolverSpec,
@@ -396,12 +480,27 @@ export async function createDataCreationLoop(
               ledger: cost,
               signal: ctx.signal,
             })
-          : 0
+          : emptyEval
+        const weakScore = weak.mean
+        const strongScore = strong.mean
         const decision = quality.ok
           ? accept({ strongScore, weakScore })
           : { accept: false, reason: quality.reason }
         const gap = strongScore - weakScore
         evaluations.set(ctx.iteration, { example, weakScore, strongScore, gap, decision })
+        // Emit the full attempt (accept OR reject) so the null is diagnosable from the raw answers.
+        if (config.onAttempt) {
+          await config.onAttempt({
+            slotIndex: i,
+            iteration: ctx.iteration,
+            example,
+            weak,
+            strong,
+            gap,
+            decision,
+            qualityOk: quality.ok,
+          })
+        }
         return { valid: decision.accept, score: gap, notes: decision.reason }
       },
     }
@@ -424,7 +523,12 @@ export async function createDataCreationLoop(
       tags: { role: 'challenger' },
     })
 
-    const plain = evaluations.get(0)
+    // The "plain" (un-refined) baseline is the FIRST candidate that was actually evaluated — the
+    // earliest recorded iteration. Reading a hardcoded index 0 silently drops the baseline whenever
+    // the first challenger draft errored (e.g. unparseable JSON), which is exactly when the slot's
+    // first SUCCESSFUL draft sits at a later index. Take the min recorded iteration instead.
+    const firstIteration = [...evaluations.keys()].sort((a, b) => a - b)[0]
+    const plain = firstIteration === undefined ? undefined : evaluations.get(firstIteration)
     if (plain) plainGaps.push(plain.gap)
 
     const slotGaps = [...evaluations.values()].map((e) => e.gap)
diff --git a/src/autodata/grounding.ts b/src/autodata/grounding.ts
index 130a177..29a7e3d 100644
--- a/src/autodata/grounding.ts
+++ b/src/autodata/grounding.ts
@@ -3,17 +3,23 @@
  * (`politeFetch` → `htmlToText` → `chunkMarkdown`). Fetches the page, strips it to text, chunks it,
  * and selects ONE content-rich chunk as the grounding excerpt the challenger writes questions from.
  *
- * The default source is the "Attention Is All You Need" paper via ar5iv (arXiv's LaTeX→HTML service),
- * a stable real paper with multi-step-reasoning content that affords genuinely discriminating
- * questions. Any arXiv / ar5iv URL works; pass a `focus` term to bias chunk selection toward a section.
+ * The default source is the Mixtral-of-Experts paper (arXiv 2401.04088) via ar5iv. The doc CHOICE is
+ * load-bearing: a hard question only separates a small solver from a frontier one if the small solver
+ * cannot just RECALL the answer from pretraining. The canonical "Attention Is All You Need" paper is
+ * the worst case — an 8B has memorized it, so even reasoning questions are answerable from memory and
+ * the strong/weak gap collapses (an empirically-verified null). Mixtral (Jan 2024) post-dates the 8B
+ * weak solver's knowledge cutoff, so it must reason from the provided context — which is where a
+ * non-extractive causal question opens a real gap. Any arXiv / ar5iv URL works; pass `focus` to bias
+ * chunk selection toward a section.
  */
 
 import { chunkMarkdown } from '../chunking'
 import { htmlToText } from '../sources/html'
 import { politeFetch } from '../sources/http'
 
-/** A stable real arXiv paper (Transformer / "Attention Is All You Need") rendered to HTML by ar5iv. */
-export const DEFAULT_SOURCE_URL = 'https://ar5iv.labs.arxiv.org/html/1706.03762'
+/** A stable real arXiv paper (Mixtral of Experts) rendered to HTML by ar5iv — see the note above on
+ *  why a NON-memorized doc is required for the strong/weak gap to open. */
+export const DEFAULT_SOURCE_URL = 'https://ar5iv.labs.arxiv.org/html/2401.04088'
 
 export interface GroundDocOptions {
   url: string
diff --git a/src/autodata/index.ts b/src/autodata/index.ts
index e211f33..c1372a7 100644
--- a/src/autodata/index.ts
+++ b/src/autodata/index.ts
@@ -18,6 +18,7 @@ export {
 } from './build-dataset'
 export {
   type AcceptDecision,
+  type AttemptRecord,
   createDataCreationLoop,
   type DataCreationConfig,
   type DataCreationResult,
@@ -26,6 +27,8 @@ export {
   type ExampleEvaluation,
   qualityCheck,
   type SolverArtifact,
+  type SolverEval,
+  type SolverSample,
 } from './data-creation-loop'
 export {
   DEFAULT_SOURCE_URL,
@@ -37,6 +40,7 @@ export {
   type AutodataRoles,
   buildAutodataRoles,
   CHALLENGER_MODEL,
+  type ChallengerStyle,
   DEFAULT_BASE_URL,
   JUDGE_MODEL,
   parseDataExample,
diff --git a/src/autodata/router-roles.ts b/src/autodata/router-roles.ts
index 39801ef..30b5769 100644
--- a/src/autodata/router-roles.ts
+++ b/src/autodata/router-roles.ts
@@ -3,16 +3,20 @@
  *
  * One transport seam — `routerChat` — POSTs `/chat/completions` and returns content + exact token
  * usage + a per-call USD cost (the router's own cost when it returns one, else a documented
- * rate-table estimate over the exact token counts; the source is flagged, never silently faked).
- * The four roles are materialized on top of it:
- *   • challenger (glm-5.2) → an `inProcessSandboxClient` that asks for ONE JSON example and parses it
- *   • weak solver (qwen-2.5-7b) / strong solver (qwen3-235b) → `inProcessSandboxClient` answer workers
- *   • judge (glm-5.2) → an `llmJudge` `JudgeConfig` whose transport is a `sandbox-sdk` ChatClient
- *     wrapping `routerChat`; the judge's own spend is recorded into the same `CostLedger` (the loop
- *     only aggregates challenger + solver spend, so the judge channel would otherwise be invisible).
+ * rate-table estimate over the exact token counts; the source is flagged, never silently faked). It
+ * retries only TRANSIENT failures (the router's "upstream capacity, retry shortly" 503s, 429/502/504,
+ * network blips, per-request timeouts) with bounded backoff; a non-transient non-2xx fails loud.
+ * The four roles are materialized on top of it (all models env-overridable — see the constants below):
+ *   • challenger (`deepseek-v4-flash`) → an `inProcessSandboxClient` that authors ONE NON-EXTRACTIVE
+ *     causal/comparative/mechanism/thesis-consistency JSON example and parses it.
+ *   • weak solver (`groq/llama-3.1-8b-instant`) / strong solver (`gemini-2.5-pro`) → answer workers.
+ *   • judge (`deepseek-v4-flash`) → an `llmJudge` `JudgeConfig` whose transport is a `sandbox-sdk`
+ *     ChatClient wrapping `routerChat`; the judge's own spend is recorded into the same `CostLedger`
+ *     (the loop only aggregates challenger + solver spend, so the judge channel is recorded here).
  *
- * glm-5.2 returns empty content unless `max_tokens` is generous, so every glm call is floored and the
- * judge is built with an explicit `maxTokens`.
+ * A reasoning model spends its budget on hidden reasoning and returns EMPTY visible content when
+ * `max_tokens` is too low (a glm/gemini footgun), so every call is floored and solvers fail loud on
+ * empty content rather than scoring a non-answer as 0 (which would corrupt the gap).
  */
 
 import {
@@ -30,20 +34,20 @@ import type { DataExample, SolverArtifact } from './data-creation-loop'
 
 export const DEFAULT_BASE_URL = 'https://router.tangle.tools/v1'
 
-// A genuine small-vs-large tier in one model family. The brief specified the Qwen tier
-// (`qwen/qwen-2.5-7b-instruct` weak, `qwen/qwen3-235b-a22b` strong), but on the live Tangle router
-// EVERY Qwen id 401s `No API key configured for model` for this key — the Qwen upstream is not
-// provisioned (verified by probing `/v1/chat/completions` across the `/v1/models` catalog). The
-// GLM family IS served, so the real tier here is the smallest GLM (`glm-4.5-air`) as the weak solver
-// vs the latest (`glm-5.2`) as the strong solver. Same family, a real generational/size gap; swap
-// these constants back to the Qwen ids once the router provisions that upstream.
+// The proven-working tier on the live Tangle router, every id env-overridable:
+//   • weak solver `groq/llama-3.1-8b-instant` — an 8B whose knowledge cutoff predates the default
+//     grounding doc, so on non-memorized content it must REASON from the context (it can't recall),
+//     which is what lets a hard causal question separate it from a frontier solver.
+//   • strong solver `gemini-2.5-pro` — a frontier reasoner (a real wide capability gap vs the 8B).
+//   • challenger + judge `deepseek-v4-flash` — a capable, fast, RELIABLE author/grader that is a
+//     DIFFERENT family from both solvers (so the judge does not favour either solver's style). The
+//     brief's `glm-5.2` works too when the router has GLM capacity; swap it back via env when it is up.
 // The solver tier is the experiment's load-bearing knob — a real strong>weak capability gap is
-// required for any example to clear the discriminative bar. Overridable by env so the tier can be
-// swept without a code change (e.g. AUTODATA_STRONG_MODEL=gemini-2.5-pro AUTODATA_WEAK_MODEL=groq/llama-3.1-8b-instant).
-export const WEAK_SOLVER_MODEL = process.env.AUTODATA_WEAK_MODEL ?? 'glm-4.5-air'
-export const STRONG_SOLVER_MODEL = process.env.AUTODATA_STRONG_MODEL ?? 'glm-5.2'
-export const CHALLENGER_MODEL = process.env.AUTODATA_CHALLENGER_MODEL ?? 'glm-5.2'
-export const JUDGE_MODEL = process.env.AUTODATA_JUDGE_MODEL ?? 'glm-5.2'
+// required for any example to clear the discriminative bar.
+export const WEAK_SOLVER_MODEL = process.env.AUTODATA_WEAK_MODEL ?? 'groq/llama-3.1-8b-instant'
+export const STRONG_SOLVER_MODEL = process.env.AUTODATA_STRONG_MODEL ?? 'gemini-2.5-pro'
+export const CHALLENGER_MODEL = process.env.AUTODATA_CHALLENGER_MODEL ?? 'deepseek-v4-flash'
+export const JUDGE_MODEL = process.env.AUTODATA_JUDGE_MODEL ?? 'deepseek-v4-flash'
 
 interface ModelPrice {
   /** USD per 1M input tokens. */
@@ -60,7 +64,9 @@ interface ModelPrice {
  */
 const PRICE_TABLE: Record<string, ModelPrice> = {
   'glm-4.5-air': { inputPerM: 0.2, outputPerM: 0.6 },
+  'glm-4.6': { inputPerM: 0.6, outputPerM: 2.2 },
   'glm-5.2': { inputPerM: 0.95, outputPerM: 3.0 },
+  'deepseek-v4-flash': { inputPerM: 0.27, outputPerM: 0.41 },
   // Wide-tier solver pair (a genuine small-vs-frontier capability gap). Approximate router rates.
   'groq/llama-3.1-8b-instant': { inputPerM: 0.05, outputPerM: 0.08 },
   'gemini-2.5-pro': { inputPerM: 1.25, outputPerM: 10.0 },
@@ -87,6 +93,10 @@ export interface RouterChatInput {
   jsonMode?: boolean
   signal?: AbortSignal
   onCall?: (rec: RouterCallRecord) => void
+  /** Per-request deadline so a stalled upstream can't hang the loop. Default 60s. */
+  timeoutMs?: number
+  /** Bounded retries on TRANSIENT failures (503/429/502/504, network, timeout). Default 4. */
+  maxRetries?: number
 }
 
 export interface RouterChatResult {
@@ -128,27 +138,78 @@ function estimateCostUsd(model: string, promptTokens: number, completionTokens:
  * exact prompt/completion token counts, and a USD cost (router-reported when present, else
  * rate-estimated over the real token counts) with its source flagged.
  */
+/** Transient upstream statuses the router itself tells us to "retry shortly" — safe to re-issue. */
+const transientStatuses = new Set([429, 502, 503, 504])
+
+function sleep(ms: number): Promise<void> {
+  return new Promise((resolve) => setTimeout(resolve, ms))
+}
+
+/** Exponential backoff with jitter: ~1s, 2s, 4s, 8s (capped 10s) — bounded by maxRetries. */
+function backoffMs(attempt: number): number {
+  return Math.min(10_000, 2 ** attempt * 1000) + Math.floor(Math.random() * 250)
+}
+
 export async function routerChat(input: RouterChatInput): Promise<RouterChatResult> {
   const baseUrl = (input.baseUrl ?? DEFAULT_BASE_URL).replace(/\/$/, '')
   const max_tokens = Math.max(input.maxTokens, maxTokensFloor(input.model))
-  const res = await fetch(`${baseUrl}/chat/completions`, {
-    method: 'POST',
-    headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${input.apiKey}` },
-    signal: input.signal,
-    body: JSON.stringify({
-      model: input.model,
-      messages: input.messages,
-      max_tokens,
-      temperature: input.temperature ?? 0.2,
-      stream: false,
-      ...(input.jsonMode ? { response_format: { type: 'json_object' } } : {}),
-    }),
+  const timeoutMs = input.timeoutMs ?? 60_000
+  const maxRetries = input.maxRetries ?? 4
+  const payload = JSON.stringify({
+    model: input.model,
+    messages: input.messages,
+    max_tokens,
+    temperature: input.temperature ?? 0.2,
+    stream: false,
+    ...(input.jsonMode ? { response_format: { type: 'json_object' } } : {}),
   })
-  if (!res.ok) {
-    const detail = await res.text().catch(() => res.statusText)
-    throw new Error(`router ${res.status} for ${input.model}: ${detail.slice(0, 400)}`)
+
+  // One non-streaming chat call, retried only on TRANSIENT failures (the router's own
+  // "upstream capacity, retry shortly" 503s, plus 429/502/504, network errors, and per-request
+  // timeouts). A non-transient non-2xx (401/400/404) fails loud immediately — never silently.
+  let body: Record<string, unknown> | undefined
+  let lastTransient = ''
+  for (let attempt = 0; attempt <= maxRetries; attempt++) {
+    // Combine the caller's abort with a per-request deadline so a stalled upstream can't hang us.
+    const deadline = AbortSignal.timeout(timeoutMs)
+    const signal = input.signal ? AbortSignal.any([input.signal, deadline]) : deadline
+    let res: Response
+    try {
+      res = await fetch(`${baseUrl}/chat/completions`, {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${input.apiKey}` },
+        signal,
+        body: payload,
+      })
+    } catch (err) {
+      // The caller's own abort is final; a timeout/network blip is transient and retryable.
+      if (input.signal?.aborted) throw err
+      lastTransient = `network/timeout: ${String(err).slice(0, 120)}`
+      if (attempt < maxRetries) {
+        await sleep(backoffMs(attempt))
+        continue
+      }
+      throw new Error(
+        `router call for ${input.model} failed after ${attempt + 1} tries — ${lastTransient}`,
+      )
+    }
+    if (!res.ok) {
+      const detail = await res.text().catch(() => res.statusText)
+      if (transientStatuses.has(res.status) && attempt < maxRetries) {
+        lastTransient = `${res.status}: ${detail.slice(0, 120)}`
+        await sleep(backoffMs(attempt))
+        continue
+      }
+      throw new Error(`router ${res.status} for ${input.model}: ${detail.slice(0, 400)}`)
+    }
+    body = (await res.json()) as Record<string, unknown>
+    break
+  }
+  if (!body) {
+    throw new Error(
+      `router call for ${input.model} exhausted ${maxRetries + 1} tries — ${lastTransient}`,
+    )
   }
-  const body = (await res.json()) as Record<string, unknown>
   const choice = (body.choices as { message?: { content?: string }; finish_reason?: string }[])?.[0]
   const usage = (body.usage ?? {}) as { prompt_tokens?: number; completion_tokens?: number }
   const promptTokens = usage.prompt_tokens ?? 0
@@ -232,20 +293,68 @@ export function parseDataExample(text: string): DataExample {
 
 // ── The roles ─────────────────────────────────────────────────────────────────────────────────
 
+// The non-extractive challenger. The prior loop nulled because the context LEAKED the answer:
+// the question was recall/lookup, so an 8B read it out as well as a frontier model and the gap
+// collapsed to ~0. The fix is the paper's: ask CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY
+// questions whose answer is an INFERENCE, and withhold the conclusion from the context the solver
+// sees ("problems only, no solution") so the answer must be DERIVED, never quoted.
 const challengerSystem =
-  'You write ONE hard exam question from a source document. The question must require multi-step ' +
-  'reasoning a small model would get wrong but a strong model would get right — never a verbatim ' +
-  'lookup. Return STRICT JSON and nothing else: ' +
-  '{"context": string, "question": string, "reference": string, "rubric": string[] }. ' +
-  'The context is a short excerpt from the document; the question must NOT be answerable by copying ' +
-  'a sentence; the reference is the correct answer; the rubric is 2-3 scoring criteria. ' +
-  'Do NOT put the reference answer verbatim inside the context.'
+  'You are an exam author. From a source excerpt, write ONE hard question that tests REASONING, ' +
+  'not recall.\n\n' +
+  'The question MUST be exactly one of these kinds:\n' +
+  '  • CAUSAL — "why does X fail / what breaks if Y is omitted".\n' +
+  '  • COMPARATIVE — "how does the tradeoff of X differ from Y, and why".\n' +
+  '  • MECHANISM — "walk through how X produces Y; what fails if a step is skipped".\n' +
+  '  • THESIS-CONSISTENCY — "which of two explanations the text offers is more consistent with its ' +
+  'overall conclusion, and how would the other undermine it".\n' +
+  'It must NEVER be recall / lookup / definition / enumeration ("what is X", "which header", ' +
+  '"list the steps", "name the ...").\n\n' +
+  'ANTI-LEAKAGE (mandatory):\n' +
+  '  • The CONTEXT must contain ONLY the premises/evidence the solver needs. It MUST NOT contain a ' +
+  'sentence that states the answer or the conclusion — the answer has to be DERIVED from the ' +
+  'premises, not quotable from the context.\n' +
+  '  • The REFERENCE is the correct DERIVED conclusion plus its reasoning chain. It must NOT appear ' +
+  'verbatim in the context.\n\n' +
+  'The RUBRIC is 2-3 criteria that reward the REASONING STEPS (e.g. "identifies that X depends on ' +
+  'Y", "explains why removing Y causes the failure", "ties it to the stated conclusion") — never ' +
+  '"mentions the keyword".\n\n' +
+  'Return STRICT JSON and nothing else: ' +
+  '{"context": string, "question": string, "reference": string, "rubric": string[] }.'
+
+// The recall / extractive challenger — the prior (nulling) behavior, kept ONLY as the calibration
+// baseline. It writes a normal question answerable straight from the excerpt, so the answer is in
+// the context and a small model reads it out as well as a frontier one. The causal vs recall gap is
+// the calibration that proves the lever.
+const recallChallengerSystem =
+  'You write ONE exam question from a source excerpt. Provide a short CONTEXT excerpt the question ' +
+  'is answerable from, the QUESTION, the REFERENCE answer, and a 2-3 item RUBRIC. ' +
+  'Return STRICT JSON and nothing else: ' +
+  '{"context": string, "question": string, "reference": string, "rubric": string[] }.'
+
+/** Which challenger prompt to materialize: the non-extractive causal author, or the recall baseline. */
+export type ChallengerStyle = 'causal' | 'recall'
 
+function challengerSystemFor(style: ChallengerStyle): string {
+  return style === 'recall' ? recallChallengerSystem : challengerSystem
+}
+
+// The reasoning judge. It sees the CONTEXT the solver saw, so it can tell a derived answer from one
+// that merely restates the context. The `reasoning` dimension IS the negative criterion: an answer
+// that paraphrases the context without deriving the conclusion scores near 0 there — which is what
+// pulls a recall-style weak answer below the strong model's derivation and opens the gap.
 const judgeSystem =
-  'You are grading a candidate ANSWER to a question against a RUBRIC and a REFERENCE answer. ' +
-  'Return JSON {"dimensions":{"rubric_coverage":N,"correctness":N},"notes":"..."} with each score ' +
-  'in [0,1]. rubric_coverage = the fraction of rubric criteria the answer satisfies; correctness = ' +
-  'how well the answer agrees with the reference. Be strict: a vague or partial answer scores low.'
+  'You grade a candidate ANSWER to a REASONING question. You are given the CONTEXT the solver was ' +
+  'shown, the REFERENCE answer, and the RUBRIC.\n' +
+  'Return JSON {"dimensions":{"rubric_coverage":N,"correctness":N,"reasoning":N},"notes":"..."} ' +
+  'with each score in [0,1].\n' +
+  '  • rubric_coverage = fraction of the rubric criteria the answer genuinely satisfies.\n' +
+  '  • correctness = how well the DERIVED conclusion agrees with the reference.\n' +
+  '  • reasoning = quality of the DERIVATION. Score HIGH only if the answer works through WHY/HOW ' +
+  'from the premises. Score near 0 if it merely RESTATES or QUOTES the context, asserts the ' +
+  'conclusion without justifying it, or is vague.\n' +
+  'NEGATIVE CRITERION: an answer that just paraphrases the context without deriving the conclusion ' +
+  'is a recall answer — it must score LOW on reasoning AND correctness, no matter how many keywords ' +
+  'it echoes. Be strict.'
 
 export interface RouterRolesConfig {
   apiKey: string
@@ -254,6 +363,8 @@ export interface RouterRolesConfig {
   weakModel?: string
   strongModel?: string
   judgeModel?: string
+  /** Challenger prompt: 'causal' (non-extractive, default) or 'recall' (the calibration baseline). */
+  challengerStyle?: ChallengerStyle
   /** Judge spend is recorded here directly (the loop captures only challenger + solver spend). */
   ledger: CostLedger
   /** Optional sink for every router call's cost provenance. */
@@ -309,6 +420,7 @@ function solverClient(cfg: RouterRolesConfig, model: string): SandboxClient {
 
 function challengerClient(cfg: RouterRolesConfig): SandboxClient {
   const model = cfg.challengerModel ?? CHALLENGER_MODEL
+  const system = challengerSystemFor(cfg.challengerStyle ?? 'causal')
   return inProcessSandboxClient({
     onPrompt: async (prompt, ctx): Promise<SandboxEvent[]> => {
       const r = await routerChat({
@@ -316,7 +428,7 @@ function challengerClient(cfg: RouterRolesConfig): SandboxClient {
         baseUrl: cfg.baseUrl,
         model,
         messages: [
-          { role: 'system', content: challengerSystem },
+          { role: 'system', content: system },
           { role: 'user', content: prompt },
         ],
         maxTokens: 1500,
@@ -399,9 +511,18 @@ function rubricJudge(cfg: RouterRolesConfig): JudgeConfig<SolverArtifact> {
         description: 'fraction of the rubric criteria the answer satisfies',
       },
       { key: 'correctness', description: 'agreement with the reference answer' },
+      {
+        key: 'reasoning',
+        description:
+          'quality of the derivation; near 0 if the answer merely restates/quotes the context',
+      },
     ],
     scale: 'unit',
+    // The judge sees the CONTEXT so it can distinguish a derived answer from a restated one (the
+    // negative criterion). Without it, a paraphrase of the context is indistinguishable from real
+    // reasoning and the gap stays closed.
     renderUser: ({ artifact }) =>
+      `CONTEXT THE SOLVER WAS GIVEN:\n${artifact.example.context}\n\n` +
       `REFERENCE ANSWER:\n${artifact.example.reference}\n\n` +
       `RUBRIC:\n${artifact.example.rubric.map((r, i) => `${i + 1}. ${r}`).join('\n')}\n\n` +
       `CANDIDATE ANSWER:\n${artifact.answer}`,
@@ -412,6 +533,7 @@ function rubricJudge(cfg: RouterRolesConfig): JudgeConfig<SolverArtifact> {
 export function buildAutodataRoles(cfg: RouterRolesConfig): AutodataRoles {
   return {
     challenger: challengerClient(cfg),
+    // weak/strong solvers + judge are style-independent; only the challenger prompt changes.
     weakSolver: solverClient(cfg, cfg.weakModel ?? WEAK_SOLVER_MODEL),
     strongSolver: solverClient(cfg, cfg.strongModel ?? STRONG_SOLVER_MODEL),
     judge: rubricJudge(cfg),
diff --git a/src/autodata/run.ts b/src/autodata/run.ts
index cca767f..928b69c 100644
--- a/src/autodata/run.ts
+++ b/src/autodata/run.ts
@@ -8,7 +8,8 @@
  *     pnpm tsx src/autodata/run.ts
  *
  * Env knobs: AUTODATA_URL, AUTODATA_FOCUS, AUTODATA_TARGET, AUTODATA_SAMPLES, AUTODATA_MAXRETRIES,
- *            AUTODATA_OUT, TANGLE_API_KEY (or TANGLE_ROUTER_KEY).
+ *            AUTODATA_OUT, AUTODATA_ATTEMPTS (per-attempt autopsy JSONL),
+ *            AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY (or TANGLE_ROUTER_KEY).
  */
 
 import { buildAutodataDataset } from './build-dataset'
@@ -42,6 +43,7 @@ async function main(): Promise<void> {
   const samples = envInt('AUTODATA_SAMPLES', 3)
   const maxRetries = envInt('AUTODATA_MAXRETRIES', 4)
   const outPath = process.env.AUTODATA_OUT ?? 'data/autodata-dataset.jsonl'
+  const attemptsPath = process.env.AUTODATA_ATTEMPTS ?? 'data/autodata-attempts.jsonl'
 
   // ── 1. COST GATE: one cheap call per model, all must return non-empty content before the burn ──
   console.log('Autodata · cost gate (one call per model)\n')
@@ -77,6 +79,7 @@ async function main(): Promise<void> {
     apiKey,
     source: grounded,
     outPath,
+    attemptsPath,
     target,
     samples,
     maxRetries,
@@ -92,6 +95,27 @@ async function main(): Promise<void> {
     console.log(`      ${ex.decision.reason}`)
   }
 
+  // ── 4b. Autopsy: the single widest-gap attempt, with BOTH solvers' actual answers ──
+  // A gap number is only a finding if you can read why it opened. Show the strongest discrimination
+  // we saw (highest gap, accepted or not) so a human can confirm it is real reasoning, not an
+  // artifact: the weak model should genuinely fail the reasoning and the strong model get it.
+  const best = result.attempts.filter((a) => a.qualityOk).sort((a, b) => b.gap - a.gap)[0]
+  if (best) {
+    const oneLine = (s: string): string => s.replace(/\s+/g, ' ').trim()
+    console.log('\n— Autopsy: widest-gap attempt (read the answers, confirm real discrimination) —')
+    console.log(`  Q: ${oneLine(best.example.question)}`)
+    console.log(`  reference: ${oneLine(best.example.reference).slice(0, 240)}`)
+    console.log(
+      `  gap=${best.gap.toFixed(2)}  (${best.decision.accept ? 'ACCEPTED' : 'rejected'}: ${best.decision.reason})`,
+    )
+    console.log(`  WEAK   mean=${best.weak.mean.toFixed(2)}`)
+    for (const [i, s] of best.weak.samples.entries())
+      console.log(`    [w${i} score=${s.score.toFixed(2)}] ${oneLine(s.answer).slice(0, 220)}`)
+    console.log(`  STRONG mean=${best.strong.mean.toFixed(2)}`)
+    for (const [i, s] of best.strong.samples.entries())
+      console.log(`    [s${i} score=${s.score.toFixed(2)}] ${oneLine(s.answer).slice(0, 220)}`)
+  }
+
   // ── 5. The empirical calibration (paper Table 1) ──
   console.log('\n— Calibration: plain first-draft gap vs agentic loop-accepted gap —')
   console.log(
@@ -118,7 +142,8 @@ async function main(): Promise<void> {
   }
   if (result.accepted.length === 0) {
     console.log(
-      '  NOTE: 0 examples cleared the discriminative accept bar — the two GLM tiers did not separate.',
+      `  NOTE: 0 examples cleared the discriminative accept bar — ${WEAK_SOLVER_MODEL} and ` +
+        `${STRONG_SOLVER_MODEL} did not separate on these questions (see the autopsy trail).`,
     )
   }
 
@@ -142,6 +167,11 @@ async function main(): Promise<void> {
   )
 
   console.log(`\n— Dataset — ${result.rows.length} row(s) written to ${result.outPath}`)
+  if (result.attemptsPath) {
+    console.log(
+      `— Autopsy trail — ${result.attempts.length} attempt(s) (accepted + rejected) at ${result.attemptsPath}`,
+    )
+  }
 }
 
 main().catch((err) => {