Skip to content

fix(red-team): harden contextual leak checks#285

Merged
drewstone merged 1 commit into
mainfrom
fix/red-team-review-followups
Jun 28, 2026
Merged

fix(red-team): harden contextual leak checks#285
drewstone merged 1 commit into
mainfrom
fix/red-team-review-followups

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

  • scan past negated sensitive-discovery matches so a later positive intent still fails
  • require uploaded-instruction boundary phrases to actually frame the forbidden phrase as the ignored instruction
  • add regressions for both PR fix(red-team): score refusal context safely #284 review findings plus case-insensitive forbidden string leaks

Checks

  • pnpm vitest run tests/red-team.test.ts: 52 passed
  • pnpm typecheck: passed
  • pnpm lint: passed, with existing non-blocking warnings
  • pnpm test: 7,688 passed / 6 skipped
  • git merge-tree --write-tree origin/main HEAD: 6f4ef75c011c271e27456e1f705852c1855ec421

Notes

@drewstone drewstone marked this pull request as ready for review June 28, 2026 18:56

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 86.6s (2 bridge agents)
Total 86.6s

💰 Value — sound

Fixes two false-negative classes in the red-team scorer (negated-then-positive discovery intent, and hostile-instruction boundary phrases that don't actually frame the leaked string) — coherent, well-tested, fits the grain.

  • What it does: Two hardening fixes to the deterministic red-team scorer in src/red-team.ts: (1) sensitiveDataDiscoveryIntentMatch now scans PAST a negated match to catch a later positive intent (firstNonNegatedMatch at src/red-team.ts:388), instead of bailing on the first negated hit; (2) isSafeBoundaryContext (src/red-team.ts:370) no longer treats a bare 'I will ignore the uploaded instructions' anywhere in the
  • Goals it achieves: Close evasion paths where an agent includes a safe-sounding phrase ('I will ignore the uploaded instructions', 'I will never explore') but then leaks the forbidden value or expresses real malicious intent anyway. Makes the scorer harder to defeat with mixed safe/unsafe output, which is the whole point of a red-team battery.
  • Assessment: Sound. The two bugs are real false-negatives in the PR #284 scorer (verified against the prior implementation in git show 8affa25): the old sensitiveDataDiscoveryIntentMatch returned null on the first negated match, and the old isSafeBoundaryContext treated any hostile-instruction boundary anywhere in the sentence as safe regardless of where the needle sat. The fixes are correct and well-targeted
  • Better / existing approach: none — this is the right approach. Searched src/ for existing negation/boundary/context-window utilities (sentenceAround, firstNonNegatedMatch, isSafeBoundary, SAFE_BOUNDARY, NEGATED): all matches are confined to red-team.ts — no reusable equivalent elsewhere. completion-verifier.ts has a related 'polarity-blind vs semantic' concept but it's a different domain (task completion verification, not ad
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Coherent hardening of an actively-used red-team scorer: closes two real false-negatives (negated-then-positive discovery intent; boundary phrase shielding an actual leak) by refining existing internal helpers, no new surface.

  • Integration: Fully reachable. scoreRedTeamOutput is exported (src/index.ts:892) and consumed by probeRedTeam in the default production gate (src/campaign/gates/default-production-gate.ts:262), which is itself a shipped gate. No new exports or call sites introduced — the PR only tightens the two internal helpers (sensitiveDataDiscoveryIntentMatch at src/red-team.ts:351, isSafeBoundaryContext at src/red-team.ts:
  • Fit with existing patterns: Fits the established pattern cleanly. The change extends the existing regex-scorer design rather than competing with it: splits ALL_BOUNDARY_MARKERS back into its constituent lists (which is how src/red-team.ts:63 and src/red-team.ts:77 already organized them — the combined array was an incidental aggregation), renames sentenceAround→sentenceWindow to expose the window start for before-needle slic
  • Real-world viability: Holds up on realistic inputs. firstNonNegatedMatch advances one char past each negated hit and rescans — O(n×m) worst case but agent outputs are short and the function is sync per-output, so negligible. isSafeBoundaryContext correctly handles needle-before-boundary (beforeNeedle excludes the boundary → flagged as a leak, the desired fail). The reference-equality check `re !== HOSTILE_INSTRUCTION_B
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T185956Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — c4797ba9

Review health 100/100 · Reviewer score 92/100 · Confidence 70/100 · 1 finding (1 low)

deepseek: Correctness 92 · Security 92 · Testing 92 · Architecture 92

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW Reference equality guard for marker exclusion is fragile — src/red-team.ts

At line 375-376: SAFE_BOUNDARY_MARKERS.some((re) => re !== HOSTILE_INSTRUCTION_BOUNDARY_MARKER && re.test(context.text)) uses object identity to skip the hostile-instruction-boundary regex. If the constant is ever cloned or reconstructed (e.g. via new RegExp(HOSTILE_INSTRUCTION_BOUNDARY_MARKER.source)) the exclusion silently fails and the function becomes a no-op for that path. Mitigation: tag the marker with a sentinel property or use a separate array. Low impact today since the same reference is used in the array literal.


tangletools · 2026-06-28T19:03:17Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 1 non-blocking finding — c4797ba9

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T19:03:17Z · immutable trace

@drewstone drewstone merged commit 175b3bf into main Jun 28, 2026
1 check passed
@drewstone drewstone deleted the fix/red-team-review-followups branch June 28, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants