Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/results/autodata-live.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Autodata live result: a false null, autopsied, then a real (clean) null

Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
solver models, to manufacture training examples that separate a strong solver from a weak one
(the discriminative reward). The headline is a null — but the path to it is the result.

## What happened, in order

1. **First runs looked like a null with a *negative* gap.** Across two tier pairs —
`glm-4.5-air` vs `glm-5.2`, then `groq/llama-3.1-8b-instant` vs `gemini-2.5-pro` — every run
reported 0 accepted and a strong−weak gap *below zero* (plain −0.47, then −1.00). A frontier
model scoring *below* an 8B on reasoning questions is not credible.

2. **Autopsy (a direct probe on the real judge) found an artifact, not a finding.** At the solver's
`maxTokens: 1024`, the strong **reasoning** model (`gemini-2.5-pro`, and `glm-5.2` before it)
spent its whole budget on hidden reasoning and returned **empty visible content** on hard
prompts — which the judge scored 0. So "strong" was being scored 0 for *answering nothing*,
manufacturing a false negative gap. The trivial cost-gate smoke ("reply ok") didn't trigger it,
so it slipped through. (Confirmed: the same prompt at `maxTokens: 8000` → gemini answers in
956 chars and scores 1.00.)

3. **Fix (this PR).** The solver now uses a reasoning-safe `maxTokens` (8000) **and fails loud on
empty content** — an empty answer is a measurement failure, never a silent 0 that corrupts the
gap (the repo's no-silent-zeros rule). The model tier is now an env knob
(`AUTODATA_WEAK_MODEL` / `AUTODATA_STRONG_MODEL` / `…_CHALLENGER_MODEL` / `…_JUDGE_MODEL`), and
the price table covers the wide tier.

4. **The clean result.** Re-run with the fix, `llama-3.1-8b` vs `gemini-2.5-pro`:

| metric | value |
|---|---|
| accepted (discriminating) examples | **0 / 3** |
| plain gap (n=1) | 0.000 |
| refined best-gap per slot (n=3) | 0.006 |
| Δ (refined − plain) | **+0.006 — no meaningful widening** |
| spend | $0.09 |

The gap is now **~0, not negative** — `gemini-2.5-pro` and `llama-3.1-8b` score about **equally**.

## The finding

On these auto-generated, doc-grounded questions a small model performs as well as a frontier one,
because **the answer is extractable from the provided context** — reading beats reasoning, so model
capability does not separate and no example clears the discriminative bar. This is *not* a
model-tier problem (we used a genuine 8B-vs-frontier gap); it is a **question-difficulty** problem.

The lever is therefore the **challenger**, not the model tier: to open a real gap the challenger must
generate **non-extractive, reasoning-heavy** questions (multi-step derivations, numerical claims that
require following the paper's argument) — which is exactly the move the Autodata paper relies on
("the agent's initial attempt was usually a high-level summary question… subsequent rounds moved the
questions toward specific algorithmic steps the paper's actual argument required"). Our challenger,
on a single section, mostly produces extractable questions. Making it harder is the next experiment.

## Status

Mechanism: proven end-to-end on real frontier models, cost-tracked, fail-loud. Empirical
discrimination: a clean null on extractive questions. The harness is now trustworthy (no empty-→0
artifact); the open lever is challenger difficulty.
29 changes: 24 additions & 5 deletions src/autodata/router-roles.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,13 @@ export const DEFAULT_BASE_URL = 'https://router.tangle.tools/v1'
// GLM family IS served, so the real tier here is the smallest GLM (`glm-4.5-air`) as the weak solver
// vs the latest (`glm-5.2`) as the strong solver. Same family, a real generational/size gap; swap
// these constants back to the Qwen ids once the router provisions that upstream.
export const WEAK_SOLVER_MODEL = 'glm-4.5-air'
export const STRONG_SOLVER_MODEL = 'glm-5.2'
export const CHALLENGER_MODEL = 'glm-5.2'
export const JUDGE_MODEL = 'glm-5.2'
// The solver tier is the experiment's load-bearing knob — a real strong>weak capability gap is
// required for any example to clear the discriminative bar. Overridable by env so the tier can be
// swept without a code change (e.g. AUTODATA_STRONG_MODEL=gemini-2.5-pro AUTODATA_WEAK_MODEL=groq/llama-3.1-8b-instant).
export const WEAK_SOLVER_MODEL = process.env.AUTODATA_WEAK_MODEL ?? 'glm-4.5-air'
export const STRONG_SOLVER_MODEL = process.env.AUTODATA_STRONG_MODEL ?? 'glm-5.2'
export const CHALLENGER_MODEL = process.env.AUTODATA_CHALLENGER_MODEL ?? 'glm-5.2'
export const JUDGE_MODEL = process.env.AUTODATA_JUDGE_MODEL ?? 'glm-5.2'

interface ModelPrice {
/** USD per 1M input tokens. */
Expand All @@ -58,6 +61,10 @@ interface ModelPrice {
const PRICE_TABLE: Record<string, ModelPrice> = {
'glm-4.5-air': { inputPerM: 0.2, outputPerM: 0.6 },
'glm-5.2': { inputPerM: 0.95, outputPerM: 3.0 },
// Wide-tier solver pair (a genuine small-vs-frontier capability gap). Approximate router rates.
'groq/llama-3.1-8b-instant': { inputPerM: 0.05, outputPerM: 0.08 },
'gemini-2.5-pro': { inputPerM: 1.25, outputPerM: 10.0 },
'gemini-2.5-flash': { inputPerM: 0.3, outputPerM: 2.5 },
}

/** Per-call usage record surfaced to an optional sink for cost-provenance reporting. */
Expand Down Expand Up @@ -268,10 +275,22 @@ function solverClient(cfg: RouterRolesConfig, model: string): SandboxClient {
baseUrl: cfg.baseUrl,
model,
messages: [{ role: 'user', content: prompt }],
maxTokens: 1024,
// Reasoning models (gemini-2.5-pro, glm-5.2, …) spend their budget on hidden reasoning and
// emit EMPTY visible content when it is too low — at 1024 a "strong" solver returned nothing
// and was scored 0, manufacturing a false negative strong−weak gap. Give every solver room
// for reasoning + a full answer.
maxTokens: 8000,
signal: ctx.signal,
onCall: cfg.onCall,
})
// Fail loud: an empty answer is a measurement failure, not a score of 0. Letting empty → 0
// silently corrupts the strong/weak gap (the whole signal), so refuse to score it.
if (r.content.trim() === '') {
throw new Error(
`solver '${model}' returned empty visible content (likely all tokens spent on hidden ` +
`reasoning) — raise maxTokens or pick a non-reasoning solver; refusing to score it as 0`,
)
}
return [
{
type: 'llm_call',
Expand Down
Loading