Project hypothesis
This started from a narrow prompt-injection detection hypothesis:
AGT's current rules-only prompt-injection detector is useful for obvious strings and known patterns, but will miss many semantically equivalent or lightly obfuscated attacks. A local embedding + nearest-neighbour signal over a clean annotated corpus may provide additional detection evidence while staying optional, tunable, default-off, and auditable.
The goal was not to replace AGT's rules or governance layer. The goal was to test whether a semantic similarity signal could add measurable value over the current rules-only baseline, especially for attacks that do not share the exact trigger words the rules expect.
To test that, I built a synthetic annotated prompt-injection research corpus, ran the existing AGT Rust prompt-injection rules as the baseline, then compared that baseline with a local embedding/kNN scoring path. The research repo is here:
https://github.com/kerberosmansour/AGT-Embeddings-Experiment
I am opening this issue first to ask whether this is something you would like contributed upstream. If it is useful, I am happy to submit a PR in the form you prefer.
What the research repo contains
The central artifact is a large annotated prompt-injection evaluation dataset:
- 44,800 labelled examples
- 17,600 attack examples across direct override, prompt leakage, indirect injection, tool abuse, tool-result injection, output exfiltration, memory poisoning, and data-boundary abuse
- 27,200 benign examples covering security discussions, tool-use requests, quoted injection examples, documentation/code fixtures, support urgency, high-entropy structured data, and other non-attack controls
- exemplar-bank, validation, and frozen test partitions
- family/group leakage checks plus exact and near-duplicate cross-split checks
The repo also includes smoke reproduction, metadata-only artifact validators, claims mapping, and reports.
Result snapshot
Using the migrated AGT Rust prompt-injection rules as a rules-only baseline, the research repo reports:
| Approach |
Catch rate |
False positive rate |
Notes |
| AGT rules-only baseline |
about 1% |
about 8% |
Catches obvious patterns but misses most held-out attacks in this corpus. |
| Embeddings at Youden's J point |
about 88% |
about 16% |
Strong separation point, but too noisy for default blocking. |
| Embeddings at zero-FP point |
about 14% |
0% observed |
Conservative high-confidence routing signal. |
The conservative zero-FP operating point is the most interesting comparison: on the frozen test split, it raises observed attack catch rate from about 1% to about 14% while keeping observed benign false positives at 0 in this synthetic corpus.
Method at a glance
- Embedding model:
BAAI/bge-small-en-v1.5
- Runtime:
fastembed/onnxruntime-local
- Model source:
qdrant/bge-small-en-v1.5-onnx-q
- Embedding dimension: 384
- Nearest-neighbour setting selected on validation:
k=5
- Conservative validation-selected threshold:
threshold_tau=0.08026763573288917
- No hosted inference or provider scoring; only local model download/cache
- Embedding/governance readout artifacts are metadata-only and intentionally avoid raw prompt text
What I am proposing
I am not proposing that AGT make this a default-blocking detector.
The useful contribution could be one of several shapes, depending on what maintainers want:
- Contribute the research corpus, validators, and evidence reports as an AGT evaluation benchmark.
- Contribute an optional, default-off embedding/kNN scoring layer that can feed review routing or downstream policy.
- Contribute only the methodology/docs so AGT maintainers can independently rerun or adapt the experiment.
- Leave the work as an external research reference if that is preferable.
Important boundaries
This work is research evidence only:
- not validated on real traffic
- not a production safety claim
- not a certification or benchmark-coverage claim
- not a recommendation to auto-block by default
- not a replacement for AGT rules/governance
The embedding signal is best viewed as optional, tunable, auditable evidence that may augment existing AGT policy or review routing.
Question for maintainers
Would you like this contributed to microsoft/agent-governance-toolkit?
If yes, I can prepare a PR and would appreciate guidance on the preferred scope: evaluation corpus only, harness/reporting only, or an optional default-off embedding signal.
Project hypothesis
This started from a narrow prompt-injection detection hypothesis:
The goal was not to replace AGT's rules or governance layer. The goal was to test whether a semantic similarity signal could add measurable value over the current rules-only baseline, especially for attacks that do not share the exact trigger words the rules expect.
To test that, I built a synthetic annotated prompt-injection research corpus, ran the existing AGT Rust prompt-injection rules as the baseline, then compared that baseline with a local embedding/kNN scoring path. The research repo is here:
https://github.com/kerberosmansour/AGT-Embeddings-Experiment
I am opening this issue first to ask whether this is something you would like contributed upstream. If it is useful, I am happy to submit a PR in the form you prefer.
What the research repo contains
The central artifact is a large annotated prompt-injection evaluation dataset:
The repo also includes smoke reproduction, metadata-only artifact validators, claims mapping, and reports.
Result snapshot
Using the migrated AGT Rust prompt-injection rules as a rules-only baseline, the research repo reports:
The conservative zero-FP operating point is the most interesting comparison: on the frozen test split, it raises observed attack catch rate from about 1% to about 14% while keeping observed benign false positives at 0 in this synthetic corpus.
Method at a glance
BAAI/bge-small-en-v1.5fastembed/onnxruntime-localqdrant/bge-small-en-v1.5-onnx-qk=5threshold_tau=0.08026763573288917What I am proposing
I am not proposing that AGT make this a default-blocking detector.
The useful contribution could be one of several shapes, depending on what maintainers want:
Important boundaries
This work is research evidence only:
The embedding signal is best viewed as optional, tunable, auditable evidence that may augment existing AGT policy or review routing.
Question for maintainers
Would you like this contributed to
microsoft/agent-governance-toolkit?If yes, I can prepare a PR and would appreciate guidance on the preferred scope: evaluation corpus only, harness/reporting only, or an optional default-off embedding signal.