Skip to content

Bug: No way to detect run quality regression after model updates or config changes #1394

@sauravbhattacharya001

Description

@sauravbhattacharya001

Bug Description

When Anthropic ships a model update, or when action config changes, run quality can silently degrade with zero signal. Runs that used to produce 200-line PRs now produce 40-line PRs. Runs that used to touch 5 files now touch 1. Everything still reports as 'success' because the only signal is exit code 0.

Reproduction

  1. Run claude-code-action on the same set of tasks over weeks
  2. Model update happens (or you change a prompt, or context changes)
  3. Output quality drops — shorter responses, fewer changes, less thorough
  4. No alert, no metric, no way to notice until you manually compare before/after

Expected Behavior

The action should emit structured run metadata (duration, tokens used, files changed, lines added/removed, model version, truncation flag) so users can track quality over time and detect drift.

Actual Behavior

You get a pass/fail exit code and a PR comment. No time-series data. No structured metrics. No drift signal. You're blind to regression until it's obvious.

Impact

  • Model updates cause silent quality regressions across entire organizations
  • Config changes (prompt tweaks, context limits) have no measurable feedback loop
  • Multi-repo deployments have no way to compare quality across repos
  • 'It worked last week' is the only drift detection mechanism available

Suggested Fix

Emit structured telemetry as action outputs:

\\yaml

  • uses: anthropics/claude-code-action@v1
    with:
    emit_telemetry: true

Outputs:

steps.claude.outputs.run_duration_ms

steps.claude.outputs.tokens_used

steps.claude.outputs.files_changed

steps.claude.outputs.lines_added

steps.claude.outputs.lines_removed

steps.claude.outputs.model_version

steps.claude.outputs.truncated

\\

Users pipe these to Datadog/Grafana/CSV and set alerts on drift. A run going from 200 lines to 40 lines is a signal — today there's no way to see it.

This completes the triad: #1392 validates output, #1393 retries failures, this issue observes trends over time.

Building drift detection in agent-eval. Happy to contribute a reference integration.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions