Proposal: optional embedding/kNN prompt-injection signal from research corpus

## Project hypothesis

This started from a narrow prompt-injection detection hypothesis:

> AGT's current rules-only prompt-injection detector is useful for obvious strings and known patterns, but will miss many semantically equivalent or lightly obfuscated attacks. A local embedding + nearest-neighbour signal over a clean annotated corpus may provide additional detection evidence while staying optional, tunable, default-off, and auditable.

The goal was not to replace AGT's rules or governance layer. The goal was to test whether a semantic similarity signal could add measurable value over the current rules-only baseline, especially for attacks that do not share the exact trigger words the rules expect.

To test that, I built a synthetic annotated prompt-injection research corpus, ran the existing AGT Rust prompt-injection rules as the baseline, then compared that baseline with a local embedding/kNN scoring path. The research repo is here:

https://github.com/kerberosmansour/AGT-Embeddings-Experiment

I am opening this issue first to ask whether this is something you would like contributed upstream. If it is useful, I am happy to submit a PR in the form you prefer.

## What the research repo contains

The central artifact is a large annotated prompt-injection evaluation dataset:

- 44,800 labelled examples
- 17,600 attack examples across direct override, prompt leakage, indirect injection, tool abuse, tool-result injection, output exfiltration, memory poisoning, and data-boundary abuse
- 27,200 benign examples covering security discussions, tool-use requests, quoted injection examples, documentation/code fixtures, support urgency, high-entropy structured data, and other non-attack controls
- exemplar-bank, validation, and frozen test partitions
- family/group leakage checks plus exact and near-duplicate cross-split checks

The repo also includes smoke reproduction, metadata-only artifact validators, claims mapping, and reports.

## Result snapshot

Using the migrated AGT Rust prompt-injection rules as a rules-only baseline, the research repo reports:

| Approach | Catch rate | False positive rate | Notes |
|---|---:|---:|---|
| AGT rules-only baseline | about 1% | about 8% | Catches obvious patterns but misses most held-out attacks in this corpus. |
| Embeddings at Youden's J point | about 88% | about 16% | Strong separation point, but too noisy for default blocking. |
| Embeddings at zero-FP point | about 14% | 0% observed | Conservative high-confidence routing signal. |

The conservative zero-FP operating point is the most interesting comparison: on the frozen test split, it raises observed attack catch rate from about 1% to about 14% while keeping observed benign false positives at 0 in this synthetic corpus.

## Method at a glance

- Embedding model: `BAAI/bge-small-en-v1.5`
- Runtime: `fastembed/onnxruntime-local`
- Model source: `qdrant/bge-small-en-v1.5-onnx-q`
- Embedding dimension: 384
- Nearest-neighbour setting selected on validation: `k=5`
- Conservative validation-selected threshold: `threshold_tau=0.08026763573288917`
- No hosted inference or provider scoring; only local model download/cache
- Embedding/governance readout artifacts are metadata-only and intentionally avoid raw prompt text

## What I am proposing

I am not proposing that AGT make this a default-blocking detector.

The useful contribution could be one of several shapes, depending on what maintainers want:

1. Contribute the research corpus, validators, and evidence reports as an AGT evaluation benchmark.
2. Contribute an optional, default-off embedding/kNN scoring layer that can feed review routing or downstream policy.
3. Contribute only the methodology/docs so AGT maintainers can independently rerun or adapt the experiment.
4. Leave the work as an external research reference if that is preferable.

## Important boundaries

This work is research evidence only:

- not validated on real traffic
- not a production safety claim
- not a certification or benchmark-coverage claim
- not a recommendation to auto-block by default
- not a replacement for AGT rules/governance

The embedding signal is best viewed as optional, tunable, auditable evidence that may augment existing AGT policy or review routing.

## Question for maintainers

Would you like this contributed to `microsoft/agent-governance-toolkit`?

If yes, I can prepare a PR and would appreciate guidance on the preferred scope: evaluation corpus only, harness/reporting only, or an optional default-off embedding signal.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: optional embedding/kNN prompt-injection signal from research corpus #2918

Project hypothesis

What the research repo contains

Result snapshot

Method at a glance

What I am proposing

Important boundaries

Question for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Approach	Catch rate	False positive rate	Notes
AGT rules-only baseline	about 1%	about 8%	Catches obvious patterns but misses most held-out attacks in this corpus.
Embeddings at Youden's J point	about 88%	about 16%	Strong separation point, but too noisy for default blocking.
Embeddings at zero-FP point	about 14%	0% observed	Conservative high-confidence routing signal.

Proposal: optional embedding/kNN prompt-injection signal from research corpus #2918

Description

Project hypothesis

What the research repo contains

Result snapshot

Method at a glance

What I am proposing

Important boundaries

Question for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions