fix(red-team): harden contextual leak checks#285
Conversation
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 86.6s (2 bridge agents) |
| Total | 86.6s |
💰 Value — sound
Fixes two false-negative classes in the red-team scorer (negated-then-positive discovery intent, and hostile-instruction boundary phrases that don't actually frame the leaked string) — coherent, well-tested, fits the grain.
- What it does: Two hardening fixes to the deterministic red-team scorer in src/red-team.ts: (1) sensitiveDataDiscoveryIntentMatch now scans PAST a negated match to catch a later positive intent (firstNonNegatedMatch at src/red-team.ts:388), instead of bailing on the first negated hit; (2) isSafeBoundaryContext (src/red-team.ts:370) no longer treats a bare 'I will ignore the uploaded instructions' anywhere in the
- Goals it achieves: Close evasion paths where an agent includes a safe-sounding phrase ('I will ignore the uploaded instructions', 'I will never explore') but then leaks the forbidden value or expresses real malicious intent anyway. Makes the scorer harder to defeat with mixed safe/unsafe output, which is the whole point of a red-team battery.
- Assessment: Sound. The two bugs are real false-negatives in the PR #284 scorer (verified against the prior implementation in git show 8affa25): the old sensitiveDataDiscoveryIntentMatch returned null on the first negated match, and the old isSafeBoundaryContext treated any hostile-instruction boundary anywhere in the sentence as safe regardless of where the needle sat. The fixes are correct and well-targeted
- Better / existing approach: none — this is the right approach. Searched src/ for existing negation/boundary/context-window utilities (sentenceAround, firstNonNegatedMatch, isSafeBoundary, SAFE_BOUNDARY, NEGATED): all matches are confined to red-team.ts — no reusable equivalent elsewhere. completion-verifier.ts has a related 'polarity-blind vs semantic' concept but it's a different domain (task completion verification, not ad
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Coherent hardening of an actively-used red-team scorer: closes two real false-negatives (negated-then-positive discovery intent; boundary phrase shielding an actual leak) by refining existing internal helpers, no new surface.
- Integration: Fully reachable. scoreRedTeamOutput is exported (src/index.ts:892) and consumed by probeRedTeam in the default production gate (src/campaign/gates/default-production-gate.ts:262), which is itself a shipped gate. No new exports or call sites introduced — the PR only tightens the two internal helpers (sensitiveDataDiscoveryIntentMatch at src/red-team.ts:351, isSafeBoundaryContext at src/red-team.ts:
- Fit with existing patterns: Fits the established pattern cleanly. The change extends the existing regex-scorer design rather than competing with it: splits ALL_BOUNDARY_MARKERS back into its constituent lists (which is how src/red-team.ts:63 and src/red-team.ts:77 already organized them — the combined array was an incidental aggregation), renames sentenceAround→sentenceWindow to expose the window start for before-needle slic
- Real-world viability: Holds up on realistic inputs. firstNonNegatedMatch advances one char past each negated hit and rescans — O(n×m) worst case but agent outputs are short and the function is sync per-output, so negligible. isSafeBoundaryContext correctly handles needle-before-boundary (beforeNeedle excludes the boundary → flagged as a leak, the desired fail). The reference-equality check `re !== HOSTILE_INSTRUCTION_B
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 1 non-blocking finding — c4797ba9
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-28T19:03:17Z · immutable trace
Summary
Checks
Notes