Skip to content
Draft

test #11797

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions .claude/skills/investigate-continuation-leakage/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
name: investigate-continuation-leakage
description: >
Investigate scope-continuation leaks in an instrumentation. Use when asked to "investigate
continuation leakage", "find scope leaks", "why does this integration leak continuations",
"debug a leaked trace / pendingReferenceCount", or when a test needed strictTraceWrites(false)
to pass. Runs the chosen instrumentation test with the scope-continuation diagnostic enabled,
reads the logged full timeline, recaps the findings, and renders a Gantt or DAG (works whether
or not anything leaked).
user-invocable: true
context: fork
allowed-tools:
- Bash
- Read
- Edit
- Glob
- Grep
- AskUserQuestion
---

# Investigate scope-continuation leakage

dd-trace-java moves trace scopes across threads via *continuations*: a scope is **captured** on
one thread (`ScopeContinuation`, bumping `PendingTrace.pendingReferenceCount`) and later
**activated** and/or **cancelled** on another. A continuation that is never resolved (the classic
leak), resolved twice, resolved after its root span was written, or activated after resolve, keeps
a trace alive or drops a late span — and in tests forces `strictTraceWrites(false)`, masking the
bug instead of locating it.

The test-time diagnostic in `datadog.trace.agent.test.scopediag` records the lifecycle and logs a
full timeline of every continuation and scope (regardless of whether anything leaked). This skill
drives that diagnostic, reads the logged timeline, recaps it in plain language, and renders a
diagram. **The Java code no longer renders Gantt/Mermaid — you (the LLM) produce the diagram from
the timeline.**

Background: `docs/superpowers/specs/2026-06-10-scope-continuation-leak-diagnostic-design.md`.
Test-run conventions: `docs/how_to_test.md`.

## Step 1 — Select the target

Identify the suspect instrumentation. If the user named it, resolve the module directory under
`dd-java-agent/instrumentation/<framework>/<framework>-<minVersion>` with Glob. If ambiguous, list
the candidate test classes (Glob `**/src/test/**/*Test.{java,groovy}` in the module) and ask the
user which test to run with `AskUserQuestion`. You want one concrete test class (and ideally one
method) plus its Gradle module path, e.g. `:dd-java-agent:instrumentation:google-pubsub-1.116`.

> **Note:** tracking is now **always-on** for every instrumentation test (`@TrackScopeContinuations`
> sits on the `AbstractInstrumentationTest` / `InstrumentationSpecification` base classes, report-only).
> If that base-class annotation is present, **skip Step 2** — just run the test (Step 3) and read the
> timeline. Only do Step 2 when tracking is *not* already inherited (e.g. the base annotation was
> removed) or you want method-level `failOnLeak=true` enforcement.

## Step 2 — Enable tracking (only if not already inherited)

With `Edit`, add the opt-in annotation to the chosen test class (or a single method):

- Add import `datadog.trace.agent.test.scopediag.TrackScopeContinuations`.
- Annotate the class/method with `@TrackScopeContinuations`. Leave the default `failOnLeak=false` —
you want the report, not a failing test (a red test would still print the report, but the default
keeps the run green so the build doesn't stop early).

This works for both JUnit 5 Java tests (extension is auto-registered on
`AbstractInstrumentationTest`) and Groovy `InstrumentationSpecification` subclasses. **Record the
exact file path** — you will revert it in Step 7.

## Step 3 — Run the test, capturing the diagnostic output

The diagnostic does **not** write a file; at the end of every tracked test it logs the **full
timeline** (`ScopeDiagnosticsReport.renderTimeline()`) — every continuation and scope with its
events, threads, relative timing, and callsites, regardless of whether anything leaked. Run with
output captured:

```bash
./gradlew :dd-java-agent:instrumentation:<framework>-<minVersion>:test --tests '<FQCN-or-pattern>' --info 2>&1 | tee /tmp/scopediag-run.txt
```

(For the diagnostic harness's own tests the module is `:dd-java-agent:instrumentation-testing`.)
If the SLF4J line is not visible in console output, read the per-test captured stdout under
`<module>/build/test-results/**/*.xml` (the `<system-out>` element) or the HTML report under
`<module>/build/reports/tests/`.

## Step 4 — Collect the diagnostic output

Grep the captured output for `Scope/continuation timeline` — one block per test. Shape:

```
Scope/continuation timeline (N continuations, M scopes; X leaked, Y late, Z double,
W activate-after-resolve | scopes: P never-closed, Q wrong-thread)

#<seq> <STATUS> trace=<id> span=<id> "<spanName>" src=<INSTRUMENTATION|MANUAL|ITERATION|CONTEXT> [ORPHAN] [handoff] {failures} cap->resume=<ms> age=<ms>
capture +<Δms> @ <thread> at <Class.method(File.java:line)>
resume +<Δms> @ <thread> at <...>
finish +<Δms> @ <thread> at <...> (or cancel / DOUBLE / act-fail)
scope#<seq> <src> "<spanName>" open +<Δms> @ <thread> close +<Δms> @ <thread> (active <ms>) [handoff] {failures}
LEAKED (never finished or cancelled) (only when unresolved)
...
Non-continuation scopes:
scope#<seq> ...
```

Every record is listed (not just flagged ones), so you can reconstruct the full graph whether or not
anything leaked. `+Δms` is relative to the first recorded event. (`renderSummary()` — the
problem-only view — still backs `assertNoLeaks` failure messages, but the timeline is the feed.)

## Step 5 — Summarize ("resume")

Give a plain-language recap:

- The header counts (leaked / late / double / activate-after-resolve / never-closed / wrong-thread).
- The dominant flow: where continuations are captured (callsite/thread) and where they're resumed /
resolved (thread), plus any thread handoffs.
- For each flagged record (if any): its failure set and capture/open callsite (cite `file:line`).
- A one-line hypothesis when there's a problem: which advice captured the continuation and where it
should have resolved it.

## Step 6 — Visualize (auto-pick, user may override)

Build the diagram from the **timeline** (works whether or not there are leaks):

- **Gantt** — when the signal is **temporal / cross-thread** (thread handoffs, late-after-root,
never-closed, or the user wants the time view). Mermaid `gantt`, one `section` per thread; a bar
per continuation from capture→resolve and per scope from open→close using the `+Δms` offsets. Mark
leaks / never-closed `crit` to the window end; late / wrong-thread `active`; resolved-on-time
`done`; capture-only points as `milestone`.
- **DAG** — when the signal is **structural / ownership** (orphans, double-finish,
activate-after-resolve, or continuation→scope lineage). Mermaid `flowchart LR`: a node per
continuation (`#seq spanName`), its spawned scopes (linked via the nested `scope#` lines), edges
capture→resume→resolve labelled with thread + `+Δms`. Color leaked / double red, late amber,
resolved green.

If there are no problems, the diagram simply shows the healthy capture→continue→resolve flow (all
green) — that is the expected "regardless of leak" output. If unsure which shape, ask with
`AskUserQuestion`.

## Step 7 — Revert

Undo the temporary annotation so the working tree is clean:

```bash
git checkout -- <test file path from Step 2>
```

Report: the findings summary, the diagram, and that the annotation was reverted.
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ import datadog.metrics.impl.DDSketchHistograms
import datadog.metrics.impl.MonitoringImpl
import datadog.trace.agent.test.asserts.ListWriterAssert
import datadog.trace.agent.test.asserts.TagsAssert
import datadog.trace.agent.test.scopediag.ScopeDiagnostics
import datadog.trace.agent.test.scopediag.TrackScopeContinuations
import datadog.trace.agent.test.datastreams.MockFeaturesDiscovery
import datadog.trace.agent.test.datastreams.RecordingDatastreamsPayloadWriter
import datadog.trace.agent.tooling.AgentInstaller
Expand Down Expand Up @@ -112,6 +114,9 @@ import spock.lang.Shared
@SuppressWarnings('UnnecessaryDotClass')
@ExtendWith(TestClassShadowingExtension.class)
@ExtendWith(TooManyInvocationsErrorHandler.class)
// Track scope continuations for every instrumentation spec (report-only: failOnLeak defaults to
// false). @Inherited, so all specs get the full timeline dumped after each test.
@TrackScopeContinuations
abstract class InstrumentationSpecification extends DDSpecification implements AgentBuilder.Listener {
private static final long TIMEOUT_MILLIS = TimeUnit.SECONDS.toMillis(20)

Expand Down Expand Up @@ -466,6 +471,9 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
}

TEST_WRITER.start()
if (scopeDiagConfig() != null) {
ScopeDiagnostics.startRecording()
}
TEST_DATA_STREAMS_WRITER.clear()
TEST_DATA_STREAMS_MONITORING.clear()

Expand Down Expand Up @@ -499,6 +507,8 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
}
TEST_TRACER.flush()

reportScopeDiagnostics()

def util = new MockUtil()
util.detachMock(STATS_D_CLIENT)

Expand All @@ -522,6 +532,34 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
assert InstrumentationErrors.noErrors(): InstrumentationErrors.describeErrors()
}

/** Resolves the {@link TrackScopeContinuations} annotation from the feature method or spec class. */
private TrackScopeContinuations scopeDiagConfig() {
def method = specificationContext?.currentFeature?.featureMethod?.reflection
def ann = method?.getAnnotation(TrackScopeContinuations)
if (ann == null) {
ann = this.class.getAnnotation(TrackScopeContinuations)
}
return ann
}

private void reportScopeDiagnostics() {
def config = scopeDiagConfig()
if (config == null) {
return
}
ScopeDiagnostics.stop()
def report = ScopeDiagnostics.report()
// Always dump the full timeline so a graph/report can be built regardless of leaks.
println(report.renderTimeline())
try {
if (config.failOnLeak()) {
ScopeDiagnostics.assertNoLeaks()
}
} finally {
ScopeDiagnostics.reset()
}
}

private void doCheckRepeatedFinish() {
for (Map.Entry<DDSpan, List<Exception>> entry: this.spanFinishLocations.entrySet()) {
if (entry.value.size() == 1) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
import datadog.instrument.classinject.ClassInjector;
import datadog.trace.agent.test.assertions.TraceAssertions;
import datadog.trace.agent.test.assertions.TraceMatcher;
import datadog.trace.agent.test.scopediag.ScopeDiagnosticsExtension;
import datadog.trace.agent.test.scopediag.TrackScopeContinuations;
import datadog.trace.agent.tooling.AgentInstaller;
import datadog.trace.agent.tooling.InstrumenterModule;
import datadog.trace.agent.tooling.TracerInstaller;
Expand Down Expand Up @@ -54,7 +56,14 @@
* </ul>
*/
@WithConfig(key = "detailed.instrumentation.errors", value = "true")
@ExtendWith({TestClassShadowingExtension.class, AllowContextTestingExtension.class})
@ExtendWith({
TestClassShadowingExtension.class,
AllowContextTestingExtension.class,
ScopeDiagnosticsExtension.class
})
// Track scope continuations for every instrumentation test (report-only: failOnLeak defaults to
// false). @Inherited, so all subclasses get the full timeline dumped after each test.
@TrackScopeContinuations
public abstract class AbstractInstrumentationTest {
static final Instrumentation INSTRUMENTATION = ByteBuddyAgent.getInstrumentation();

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package datadog.trace.agent.test.scopediag;

import net.bytebuddy.asm.Advice;

/**
* Test-only ByteBuddy advice woven into {@code datadog.trace.core.scopemanager.ContinuableScope}
* (and, by inheritance, {@code ContinuingScope}) to track the scope activation lifecycle.
*
* <p>The target type is package-private, so {@code this} is typed as {@link Object} and re-cast
* inside {@link ScopeContinuationProbe}. {@code afterActivated} is the open point (first call per
* scope identity), {@code onProperClose} the pop, and {@code close} the wrong-thread check.
*/
public final class ContinuableScopeAdvice {
private ContinuableScopeAdvice() {}

/** {@code afterActivated()} — the scope became active. */
public static final class AfterActivated {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void exit(@Advice.This Object scope) {
ScopeContinuationProbe.onScopeOpen(scope);
}
}

/** {@code onProperClose()} — the scope was popped from its thread's stack. */
public static final class OnProperClose {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void exit(@Advice.This Object scope) {
ScopeContinuationProbe.onScopeClose(scope);
}
}

/** {@code close()} — check for an out-of-order / wrong-thread close. */
public static final class Close {
@Advice.OnMethodEnter(suppress = Throwable.class)
public static void enter(@Advice.This Object scope) {
ScopeContinuationProbe.onScopeClosing(scope);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
package datadog.trace.agent.test.scopediag;

import net.bytebuddy.asm.Advice;

/**
* Test-only ByteBuddy advice woven into {@code datadog.trace.core.scopemanager.ScopeContinuation}.
*
* <p>The target type is package-private and cannot be named here, so {@code this} is typed as
* {@link Object} and re-cast to the public {@code AgentScope.Continuation} supertype inside {@link
* ScopeContinuationProbe}. {@link Advice.FieldValue} reads the private {@code count} field — legal
* because the advice is inlined into the field's own class.
*/
public final class ContinuationAdvice {
private ContinuationAdvice() {}

/** {@code register()} — the continuation was captured. */
public static final class Register {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void exit(@Advice.This Object self) {
ScopeContinuationProbe.onCapture(self);
}
}

/**
* {@code activate()} — a (possibly noop) activation; the probe filters the rollback branch.
*
* <p>The activation timestamp is captured at method <em>entry</em>, not exit: the same-span reuse
* optimization ({@code ContinuableScopeManager.continueSpan}) cancels the continuation from
* <em>inside</em> {@code activate()} before it returns, so timestamping the resume at exit would
* order it after that internal resolution and spuriously flag {@code ACTIVATE_AFTER_RESOLVE}.
*/
public static final class Activate {
@Advice.OnMethodEnter
public static long enter() {
return System.nanoTime();
}

@Advice.OnMethodExit(suppress = Throwable.class)
public static void exit(
@Advice.This Object self, @Advice.Enter long ddActivateNanos, @Advice.Return Object scope) {
ScopeContinuationProbe.onActivate(self, scope, ddActivateNanos);
}
}

/**
* Resolution detected via the {@code count} transition. Applied to both {@code cancel()} and
* {@code cancelFromContinuedScopeClose()} — they need identical before/after observation. The
* originating method name ({@code #m}) distinguishes an explicit cancel from a normal
* finish-on-scope-close.
*
* <p>The resolve timestamp is captured at method <em>entry</em> (the {@code ddResolveNanos}
* local), not at exit: the body itself may call {@code removeContinuation() ->
* PendingTrace.write()}, which is exactly where the root-written timestamp is taken. Timestamping
* at exit would place the resolution after the root write it triggered, producing a spurious
* late-finish.
*/
public static final class Cancel {
@Advice.OnMethodEnter
public static int enter(
@Advice.FieldValue("count") int count,
@Advice.Local("ddResolveNanos") long ddResolveNanos) {
ddResolveNanos = System.nanoTime();
return count;
}

@Advice.OnMethodExit(suppress = Throwable.class)
public static void exit(
@Advice.This Object self,
@Advice.Origin("#m") String method,
@Advice.Enter int countBefore,
@Advice.Local("ddResolveNanos") long ddResolveNanos,
@Advice.FieldValue("count") int countAfter) {
ScopeContinuationProbe.onResolve(self, method, countBefore, countAfter, ddResolveNanos);
}
}
}
Loading