tangle-network · drewstone · Jun 26, 2026 · Jun 26, 2026
diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md
@@ -1,17 +1,16 @@
-# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)
+# Autodata live result: the causal-challenger loop reliably discriminates at power — 38% accept-rate, CI [23%, 55%] (NOT a coin-flip)
 
-Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
+Running the agentic data-creation loop (`src/autodata/`) on real arXiv docs with real two-tier
 solvers, to manufacture training examples that separate a strong solver from a weak one (the
 discriminative reward of the Autodata / Agentic-Self-Instruct method).
 
-**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
-**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
-method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
-∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
-0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
-(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
-**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
-mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.
+**Powered headline (32 independent slots, 2 docs, samples=4):** the loop **reliably manufactures
+discriminating examples — accept-rate 38%, Wilson 95% CI [23%, 55%]** (12 of 32 slots cleared the
+hard accept bar: weak < 0.5 ∧ strong ≥ 0.65 ∧ gap ≥ 0.2). The CI lower bound (23%) excludes ~0, so
+this is a **real, repeatable rate, not the n=1–2 luck** that made it look like a coin-flip at n=3.
+Acceptance is **doc-dependent** (mixtral 19%, deepseek-v3 56%) and gated by **whether the weak model
+struggles** (it does on only 39% of attempts), but it is decisively above zero on both docs. This
+**replaces** the earlier n=3 result, which was too noisy to tell "real rate" from "coin-flip ~0".
 
 ## The two levers that turned the null into a positive
 
@@ -29,22 +28,30 @@ both fixed here:
 
 2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
    canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
-   pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
-   memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
-   post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.
+   pretraining and capability cannot separate. Fix — **ground on docs the weak solver has not
+   memorized**: the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024) and the DeepSeek-V3 paper
+   (arXiv 2412.19437, Dec 2024), both post-dating `llama-3.1-8b`'s knowledge cutoff, forcing it to
+   reason from the context.
 
 ## Setup (all env-overridable)
 
 | role | model | why |
 |---|---|---|
-| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
+| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 docs → must reason, can't recall |
 | strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
 | challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
-| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |
+| grounding doc A | Mixtral-of-Experts (2401.04088) | non-memorized; MoE expert routing / gating (`focus=expert`) |
+| grounding doc B | DeepSeek-V3 (2412.19437) | non-memorized; auxiliary-loss-free load balancing / expert specialization (`focus=auxiliary`) |
 
-Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
-challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
-live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)
+Accept thresholds (the paper's): strong ≥ 0.65, weak < 0.50, gap ≥ 0.20. (`glm-5.2`, the brief's
+challenger/judge, was returning upstream-capacity 503s; `deepseek-v4-flash` is the live, neutral
+substitute. `routerChat` retries transient 503/429/timeout with bounded backoff.)
+
+The grounding chunk must be **prose, not equations**: an equation-dense chunk (e.g. DeepSeek-V3's MLA
+section) breaks the challenger's strict-JSON output (LaTeX backslashes), so both `focus` terms select
+the prose description of an MoE-expert mechanism. Even so, 5 of 32 slots (~16%) still hit a
+LaTeX-in-JSON failure and produced no example — those count as rejects in the headline (the
+conservative floor); see below.
 
 ## The judge is reliable (checked before trusting any gap)
 
@@ -54,70 +61,94 @@ each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs we
 measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
 errors — `deepseek` is the better grader here.)
 
-## The result — the gap opens, examples are accepted
-
-**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
-**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
-reasoning).
-
-**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**
-
-| run | accepted | gap widening (plain → refined) | note |
-|---|---|---|---|
-| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
-| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated |
-| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
-
-**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
-accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
-these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
-acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
+## The powered result — a real ~38% accept-rate
 
-## An autopsied accepted example (real discrimination, both answers read)
+**Design (fixed-slots, not until-N-accepted):** run a fixed K = 32 independent slots (each slot = one
+full challenger → refine → accept cycle), split 16 / 16 across the two docs, samples = 4 per solver
+(stabilise the weak mean), maxRetries = 2 (3 challenger attempts per slot). Record each slot's
+outcome (accept / reject) + best gap, so the rate is bounded-cost and unbiased. Runnable:
+`src/autodata/powered.ts`; per-attempt autopsy JSONL per doc; the CIs are agent-eval's published
+estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` for the paired widening).
 
-> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
-> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
-> output differ from the intended behavior, and why is this failure mode problematic?
-
-- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
-  uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
-  the point of the MoE. Correct.
-- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
-  the failure consequence; it never reaches "all experts averaged → specialization lost."
-
-When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
-leakage (the answer is not in the context). **But it does not open reliably.** In the independent
-re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
-correctly explained that high positional locality routes consecutive tokens to the same expert →
-over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
-fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
-variance that makes acceptance a coin-flip.
+| metric | value | read |
+|---|---|---|
+| **accept-rate (headline)** | **38%  CI [23%, 55%]** (12 / 32) | excludes ~0 → **reliable, not a coin-flip** |
+| accept-rate (producing slots) | 44%  CI [28%, 63%] (12 / 27) | excludes the 5 challenger-stage (LaTeX) failures |
+| — mixtral | 19%  CI [7%, 43%]  (3 / 16) | the harder doc; still excludes 0 |
+| — deepseek-v3 | 56%  CI [33%, 77%] (9 / 16) | the easier-to-discriminate doc |
+| best gap / slot (n=27) | min −0.23 · median **0.42** · p90 0.80 · max 0.95 | how far each slot separated the tiers |
+| plain (first-draft) gap / slot | min −0.23 · median 0.19 · p90 0.61 · max 0.95 | the un-refined baseline |
+| **gap-widening Δ (plain → best-refined)** | mean **+0.103**  CI [+0.029, +0.193] (paired bootstrap, n=27) | the fold's lift; **excludes 0** (median Δ 0 — it helps a minority) |
+| weak score / attempt (n=33) | min 0.05 · median **0.55** · max 1.00 | the variance source — competent ~half the time |
+| strong score / attempt (n=33) | min 0.21 · median **0.99** · max 1.00 | the strong solver almost always derives |
+
+**Accept-rule decomposition (33 quality-clean attempts):** strong ≥ 0.65 = **88%**, weak < 0.50 =
+**39%** ← the binding gate, gap ≥ 0.20 = 52%, all-three (= accept) = 36%. The strong solver derives
+almost everything; the bottleneck is the weak model failing — which happens on only ~39% of
+attempts, so the per-slot accept-rate is set by **how often `llama-3.1-8b` actually struggles**, not
+by the challenger or judge. **Total live spend: $0.57** for the 32-slot run (~$1.0 including pilots).
+
+## Two autopsied accepted examples (real discrimination, both answers read)
+
+**deepseek-v3 — gap 0.93 (weak 0.07, strong 1.00):**
+> **Q:** Why does using a *sequence-wise* auxiliary loss lead to a higher validation loss than a
+> *batch-wise* auxiliary loss or the auxiliary-loss-free method in MoE models?
+
+- **strong (`gemini-2.5-pro`): 1.00** — derives that the sequence-wise loss imposes a *stricter,
+  less flexible* per-sequence balance constraint that *hinders the emergence of expert
+  specialisation*. Correct, matches the reference.
+- **weak (`llama-3.1-8b`): [0.10, 0.03, 0.10, 0.03]** — *restates the question* and never derives the
+  reason. A recall-shaped non-answer; the judge's `reasoning` criterion floors it.
+
+**mixtral — gap 0.95 (weak 0.05, strong 1.00):**
+> **Q:** The text says each input is routed to 2 of 8 experts, yet the output sums `G(x)_i · E_i(x)`
+> over all `n` experts. Are these consistent? If not, which should be revised?
+
+- **strong: 1.00** — derives YES, consistent: the gating vector `G(x)` is *sparse* (nonzero only for
+  the 2 selected experts), so the full-`n` sum effectively includes only those 2. Correct.
+- **weak: [0.03, 0.07, 0.03, 0.07]** — concludes the statements are *inconsistent*; it never grasps
+  the sparse-gating equivalence. A genuine reasoning error, not a judge artifact or leakage (the
+  answer is derived, not in the context).
+
+These are real weak-fails-strong-derives examples on both docs — the loop is manufacturing genuine
+discrimination, not gaming the gap.
 
 ## The finding
 
-The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
-(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
-(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
-**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).
-
-But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
-*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
-MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
-coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
-exact small-n shape that has repeatedly looked positive here and washed out at power.
+The question "does the causal-challenger loop reliably manufacture discriminating examples, or is
+acceptance a coin-flip ~0?" is now **settled at power: it reliably works.** Accept-rate **38%, CI
+[23%, 55%]** over 32 slots — the lower bound excludes ~0, and even the harder of the two docs
+(mixtral, 19% [7%, 43%]) excludes 0. The fold also **reliably widens the gap** (mean +0.103, CI
+[+0.029, +0.193]), reproducing the n=3 direction at power, though most of the discrimination comes
+from the first causal draft already separating (median widening 0 — the refine helps a minority of
+slots).
+
+Two honest caveats, both quantified, neither overturns the verdict:
+
+1. **Doc-dependence.** The rate ranges 19% (mixtral) → 56% (deepseek-v3). The pooled 38% is a real
+   average across two non-memorized MoE papers, not a single lucky doc — but expect the rate to move
+   with the source material's difficulty for the 8B.
+2. **The binding constraint is the weak model's competence, not the method.** `llama-3.1-8b` answers
+   these MoE-reasoning questions competently (weak median 0.55) about as often as it flails, so
+   ~39% of attempts clear the "weak must struggle" gate. A weaker weak model (or harder docs) would
+   raise the rate; a stronger one would lower it. The loop's discriminative reward works as designed —
+   the rate is a property of the **tier gap**, which is exactly what it should measure.
 
 ## Status
 
-Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
-dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
-Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
-`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
-*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
-direction, not a confirmed win.
+Mechanism + observability + **power**: solid. The accept-rate is measured at n=32 with a Wilson CI
+that excludes ~0, the gap-widening with a paired-bootstrap CI that excludes 0, every attempt dumped
+to a JSONL autopsy trail, and the two headline accepted examples read end-to-end (real
+discrimination). The n=3 "coin-flip ~0?" worry is **resolved: ~38% accept-rate, not zero.**
 
 ## Reproduce
 
 ```
-dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts        # causal, default Mixtral doc
-dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts  # recall-vs-causal A/B, same doc
+# Powered accept-rate + CIs (32 slots, 2 docs, samples=4) — the headline result:
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts
+#   knobs: AUTODATA_SLOTS_PER_DOC=16  AUTODATA_SAMPLES=4  AUTODATA_MAXRETRIES=2
+
+# Single-doc builder + recall-vs-causal calibration (the lever's A/B):
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts
 ```
diff --git a/src/autodata/index.ts b/src/autodata/index.ts
@@ -36,6 +36,7 @@ export {
   type GroundedDoc,
   groundDoc,
 } from './grounding'
+export { analyzeTrails, type DocTrail, type PoweredStats } from './powered'
 export {
   type AutodataRoles,
   buildAutodataRoles,