fix(red-team): harden contextual leak checks by drewstone · Pull Request #285 · tangle-network/agent-eval

drewstone · 2026-06-28T18:46:48Z

Summary

scan past negated sensitive-discovery matches so a later positive intent still fails
require uploaded-instruction boundary phrases to actually frame the forbidden phrase as the ignored instruction
add regressions for both PR fix(red-team): score refusal context safely #284 review findings plus case-insensitive forbidden string leaks

Checks

pnpm vitest run tests/red-team.test.ts: 52 passed
pnpm typecheck: passed
pnpm lint: passed, with existing non-blocking warnings
pnpm test: 7,688 passed / 6 skipped
git merge-tree --write-tree origin/main HEAD: 6f4ef75c011c271e27456e1f705852c1855ec421

Notes

Follow-up to merged PR fix(red-team): score refusal context safely #284 review comments.
Deleted merged branch fix/red-team-refusal-context locally and remotely.

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	86.6s (2 bridge agents)
Total	86.6s

💰 Value — sound

Fixes two false-negative classes in the red-team scorer (negated-then-positive discovery intent, and hostile-instruction boundary phrases that don't actually frame the leaked string) — coherent, well-tested, fits the grain.

What it does: Two hardening fixes to the deterministic red-team scorer in src/red-team.ts: (1) sensitiveDataDiscoveryIntentMatch now scans PAST a negated match to catch a later positive intent (firstNonNegatedMatch at src/red-team.ts:388), instead of bailing on the first negated hit; (2) isSafeBoundaryContext (src/red-team.ts:370) no longer treats a bare 'I will ignore the uploaded instructions' anywhere in the
Goals it achieves: Close evasion paths where an agent includes a safe-sounding phrase ('I will ignore the uploaded instructions', 'I will never explore') but then leaks the forbidden value or expresses real malicious intent anyway. Makes the scorer harder to defeat with mixed safe/unsafe output, which is the whole point of a red-team battery.
Assessment: Sound. The two bugs are real false-negatives in the PR #284 scorer (verified against the prior implementation in git show 8affa25): the old sensitiveDataDiscoveryIntentMatch returned null on the first negated match, and the old isSafeBoundaryContext treated any hostile-instruction boundary anywhere in the sentence as safe regardless of where the needle sat. The fixes are correct and well-targeted
Better / existing approach: none — this is the right approach. Searched src/ for existing negation/boundary/context-window utilities (sentenceAround, firstNonNegatedMatch, isSafeBoundary, SAFE_BOUNDARY, NEGATED): all matches are confined to red-team.ts — no reusable equivalent elsewhere. completion-verifier.ts has a related 'polarity-blind vs semantic' concept but it's a different domain (task completion verification, not ad
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Coherent hardening of an actively-used red-team scorer: closes two real false-negatives (negated-then-positive discovery intent; boundary phrase shielding an actual leak) by refining existing internal helpers, no new surface.

Integration: Fully reachable. scoreRedTeamOutput is exported (src/index.ts:892) and consumed by probeRedTeam in the default production gate (src/campaign/gates/default-production-gate.ts:262), which is itself a shipped gate. No new exports or call sites introduced — the PR only tightens the two internal helpers (sensitiveDataDiscoveryIntentMatch at src/red-team.ts:351, isSafeBoundaryContext at src/red-team.ts:
Fit with existing patterns: Fits the established pattern cleanly. The change extends the existing regex-scorer design rather than competing with it: splits ALL_BOUNDARY_MARKERS back into its constituent lists (which is how src/red-team.ts:63 and src/red-team.ts:77 already organized them — the combined array was an incidental aggregation), renames sentenceAround→sentenceWindow to expose the window start for before-needle slic
Real-world viability: Holds up on realistic inputs. firstNonNegatedMatch advances one char past each negated hit and rescans — O(n×m) worst case but agent outputs are short and the function is sync per-output, so negligible. isSafeBoundaryContext correctly handles needle-before-boundary (beforeNeedle excludes the boundary → flagged as a leak, the desired fail). The reference-equality check `re !== HOSTILE_INSTRUCTION_B
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260628T185956Z}

tangletools · 2026-06-28T19:03:20Z

✅ No Blockers — `c4797ba9`

Review health 100/100 · Reviewer score 92/100 · Confidence 70/100 · 1 finding (1 low)

deepseek: Correctness 92 · Security 92 · Testing 92 · Architecture 92

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW Reference equality guard for marker exclusion is fragile — src/red-team.ts

At line 375-376: SAFE_BOUNDARY_MARKERS.some((re) => re !== HOSTILE_INSTRUCTION_BOUNDARY_MARKER && re.test(context.text)) uses object identity to skip the hostile-instruction-boundary regex. If the constant is ever cloned or reconstructed (e.g. via new RegExp(HOSTILE_INSTRUCTION_BOUNDARY_MARKER.source)) the exclusion silently fails and the function becomes a no-op for that path. Mitigation: tag the marker with a sentinel property or use a separate array. Low impact today since the same reference is used in the array literal.

_{tangletools · 2026-06-28T19:03:17Z · trace}

tangletools

✅ Approved — 1 non-blocking finding — `c4797ba9`

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-28T19:03:17Z · immutable trace}

fix(red-team): harden contextual leak checks

c4797ba

drewstone marked this pull request as ready for review June 28, 2026 18:56

tangletools reviewed Jun 28, 2026

View reviewed changes

tangletools approved these changes Jun 28, 2026

View reviewed changes

drewstone merged commit 175b3bf into main Jun 28, 2026
1 check passed

drewstone deleted the fix/red-team-review-followups branch June 28, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(red-team): harden contextual leak checks#285

fix(red-team): harden contextual leak checks#285
drewstone merged 1 commit into
mainfrom
fix/red-team-review-followups

drewstone commented Jun 28, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 28, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 28, 2026

Summary

Checks

Notes

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

Uh oh!

tangletools commented Jun 28, 2026

✅ No Blockers — c4797ba9

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 1 non-blocking finding — c4797ba9

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `c4797ba9`

✅ Approved — 1 non-blocking finding — `c4797ba9`