Skip to content

Proposal: optional embedding/kNN prompt-injection signal from research corpus #2918

@kerberosmansour

Description

@kerberosmansour

Project hypothesis

This started from a narrow prompt-injection detection hypothesis:

AGT's current rules-only prompt-injection detector is useful for obvious strings and known patterns, but will miss many semantically equivalent or lightly obfuscated attacks. A local embedding + nearest-neighbour signal over a clean annotated corpus may provide additional detection evidence while staying optional, tunable, default-off, and auditable.

The goal was not to replace AGT's rules or governance layer. The goal was to test whether a semantic similarity signal could add measurable value over the current rules-only baseline, especially for attacks that do not share the exact trigger words the rules expect.

To test that, I built a synthetic annotated prompt-injection research corpus, ran the existing AGT Rust prompt-injection rules as the baseline, then compared that baseline with a local embedding/kNN scoring path. The research repo is here:

https://github.com/kerberosmansour/AGT-Embeddings-Experiment

I am opening this issue first to ask whether this is something you would like contributed upstream. If it is useful, I am happy to submit a PR in the form you prefer.

What the research repo contains

The central artifact is a large annotated prompt-injection evaluation dataset:

  • 44,800 labelled examples
  • 17,600 attack examples across direct override, prompt leakage, indirect injection, tool abuse, tool-result injection, output exfiltration, memory poisoning, and data-boundary abuse
  • 27,200 benign examples covering security discussions, tool-use requests, quoted injection examples, documentation/code fixtures, support urgency, high-entropy structured data, and other non-attack controls
  • exemplar-bank, validation, and frozen test partitions
  • family/group leakage checks plus exact and near-duplicate cross-split checks

The repo also includes smoke reproduction, metadata-only artifact validators, claims mapping, and reports.

Result snapshot

Using the migrated AGT Rust prompt-injection rules as a rules-only baseline, the research repo reports:

Approach Catch rate False positive rate Notes
AGT rules-only baseline about 1% about 8% Catches obvious patterns but misses most held-out attacks in this corpus.
Embeddings at Youden's J point about 88% about 16% Strong separation point, but too noisy for default blocking.
Embeddings at zero-FP point about 14% 0% observed Conservative high-confidence routing signal.

The conservative zero-FP operating point is the most interesting comparison: on the frozen test split, it raises observed attack catch rate from about 1% to about 14% while keeping observed benign false positives at 0 in this synthetic corpus.

Method at a glance

  • Embedding model: BAAI/bge-small-en-v1.5
  • Runtime: fastembed/onnxruntime-local
  • Model source: qdrant/bge-small-en-v1.5-onnx-q
  • Embedding dimension: 384
  • Nearest-neighbour setting selected on validation: k=5
  • Conservative validation-selected threshold: threshold_tau=0.08026763573288917
  • No hosted inference or provider scoring; only local model download/cache
  • Embedding/governance readout artifacts are metadata-only and intentionally avoid raw prompt text

What I am proposing

I am not proposing that AGT make this a default-blocking detector.

The useful contribution could be one of several shapes, depending on what maintainers want:

  1. Contribute the research corpus, validators, and evidence reports as an AGT evaluation benchmark.
  2. Contribute an optional, default-off embedding/kNN scoring layer that can feed review routing or downstream policy.
  3. Contribute only the methodology/docs so AGT maintainers can independently rerun or adapt the experiment.
  4. Leave the work as an external research reference if that is preferable.

Important boundaries

This work is research evidence only:

  • not validated on real traffic
  • not a production safety claim
  • not a certification or benchmark-coverage claim
  • not a recommendation to auto-block by default
  • not a replacement for AGT rules/governance

The embedding signal is best viewed as optional, tunable, auditable evidence that may augment existing AGT policy or review routing.

Question for maintainers

Would you like this contributed to microsoft/agent-governance-toolkit?

If yes, I can prepare a PR and would appreciate guidance on the preferred scope: evaluation corpus only, harness/reporting only, or an optional default-off embedding signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions