Skip to content

Add a prompt-injection evaluation fixture #2923

@kerberosmansour

Description

@kerberosmansour

Summary

This issue tracks adding a standalone prompt-injection evaluation fixture for
AGT.

The fixture would provide:

  • a synthetic labelled prompt-injection corpus;
  • manifest and corpus hygiene checks;
  • a baseline harness for the existing Rust prompt-injection detector;
  • summary metrics for the rules-only detector on this fixture;
  • documentation for reproducing the benchmark and interpreting its limits.

Scope

This is an evaluation fixture only. It does not propose a runtime behavior
change, embedding detector, default threshold, policy-routing integration, or
default blocking behavior.

Why

AGT's existing rules layer is intentionally high-precision and low-recall. A
standalone fixture would make that trade-off measurable on a reproducible
benchmark and give future prompt-injection detector changes a stable baseline.

Acceptance Criteria

  • The fixture can be run from the repository with documented commands.
  • Corpus split, duplicate, and leakage checks are reproducible.
  • Baseline metrics are tied to an exact AGT commit, detector file hash, corpus
    manifest hash, and command.
  • Documentation states that results are corpus-specific and not a general claim
    about AGT detection quality.
  • No AGT runtime behavior changes are included.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions