DataDog · amarziali · Jun 11, 2026 · Jun 11, 2026 · Jun 12, 2026 · Jun 12, 2026
@@ -0,0 +1,143 @@
+---
+name: investigate-continuation-leakage
+description: >
+  Investigate scope-continuation leaks in an instrumentation. Use when asked to "investigate
+  continuation leakage", "find scope leaks", "why does this integration leak continuations",
+  "debug a leaked trace / pendingReferenceCount", or when a test needed strictTraceWrites(false)
+  to pass. Runs the chosen instrumentation test with the scope-continuation diagnostic enabled,
+  reads the logged full timeline, recaps the findings, and renders a Gantt or DAG (works whether
+  or not anything leaked).
+user-invocable: true
+context: fork
+allowed-tools:
+  - Bash
+  - Read
+  - Edit
+  - Glob
+  - Grep
+  - AskUserQuestion
+---
+
+# Investigate scope-continuation leakage
+
+dd-trace-java moves trace scopes across threads via *continuations*: a scope is **captured** on
+one thread (`ScopeContinuation`, bumping `PendingTrace.pendingReferenceCount`) and later
+**activated** and/or **cancelled** on another. A continuation that is never resolved (the classic
+leak), resolved twice, resolved after its root span was written, or activated after resolve, keeps
+a trace alive or drops a late span — and in tests forces `strictTraceWrites(false)`, masking the
+bug instead of locating it.
+
+The test-time diagnostic in `datadog.trace.agent.test.scopediag` records the lifecycle and logs a
+full timeline of every continuation and scope (regardless of whether anything leaked). This skill
+drives that diagnostic, reads the logged timeline, recaps it in plain language, and renders a
+diagram. **The Java code no longer renders Gantt/Mermaid — you (the LLM) produce the diagram from
+the timeline.**
+
+Background: `docs/superpowers/specs/2026-06-10-scope-continuation-leak-diagnostic-design.md`.
+Test-run conventions: `docs/how_to_test.md`.
+
+## Step 1 — Select the target
+
+Identify the suspect instrumentation. If the user named it, resolve the module directory under
+`dd-java-agent/instrumentation/<framework>/<framework>-<minVersion>` with Glob. If ambiguous, list
+the candidate test classes (Glob `**/src/test/**/*Test.{java,groovy}` in the module) and ask the
+user which test to run with `AskUserQuestion`. You want one concrete test class (and ideally one
+method) plus its Gradle module path, e.g. `:dd-java-agent:instrumentation:google-pubsub-1.116`.
+
+> **Note:** tracking is now **always-on** for every instrumentation test (`@TrackScopeContinuations`
+> sits on the `AbstractInstrumentationTest` / `InstrumentationSpecification` base classes, report-only).
+> If that base-class annotation is present, **skip Step 2** — just run the test (Step 3) and read the
+> timeline. Only do Step 2 when tracking is *not* already inherited (e.g. the base annotation was
+> removed) or you want method-level `failOnLeak=true` enforcement.
+
+## Step 2 — Enable tracking (only if not already inherited)
+
+With `Edit`, add the opt-in annotation to the chosen test class (or a single method):
+
+- Add import `datadog.trace.agent.test.scopediag.TrackScopeContinuations`.
+- Annotate the class/method with `@TrackScopeContinuations`. Leave the default `failOnLeak=false` —
+  you want the report, not a failing test (a red test would still print the report, but the default
+  keeps the run green so the build doesn't stop early).
+
+This works for both JUnit 5 Java tests (extension is auto-registered on
+`AbstractInstrumentationTest`) and Groovy `InstrumentationSpecification` subclasses. **Record the
+exact file path** — you will revert it in Step 7.
+
+## Step 3 — Run the test, capturing the diagnostic output
+
+The diagnostic does **not** write a file; at the end of every tracked test it logs the **full
+timeline** (`ScopeDiagnosticsReport.renderTimeline()`) — every continuation and scope with its
+events, threads, relative timing, and callsites, regardless of whether anything leaked. Run with
+output captured:
+
+```bash
+./gradlew :dd-java-agent:instrumentation:<framework>-<minVersion>:test --tests '<FQCN-or-pattern>' --info 2>&1 | tee /tmp/scopediag-run.txt
+```
+
+(For the diagnostic harness's own tests the module is `:dd-java-agent:instrumentation-testing`.)
+If the SLF4J line is not visible in console output, read the per-test captured stdout under
+`<module>/build/test-results/**/*.xml` (the `<system-out>` element) or the HTML report under
+`<module>/build/reports/tests/`.
+
+## Step 4 — Collect the diagnostic output
+
+Grep the captured output for `Scope/continuation timeline` — one block per test. Shape:
+
+```
+Scope/continuation timeline (N continuations, M scopes; X leaked, Y late, Z double,
+    W activate-after-resolve | scopes: P never-closed, Q wrong-thread)
+
+#<seq> <STATUS> trace=<id> span=<id> "<spanName>" src=<INSTRUMENTATION|MANUAL|ITERATION|CONTEXT> [ORPHAN] [handoff] {failures}  cap->resume=<ms>  age=<ms>
+  capture   +<Δms>  @ <thread>  at <Class.method(File.java:line)>
+  resume    +<Δms>  @ <thread>  at <...>
+  finish    +<Δms>  @ <thread>  at <...>          (or cancel / DOUBLE / act-fail)
+  scope#<seq> <src> "<spanName>"  open +<Δms> @ <thread>  close +<Δms> @ <thread> (active <ms>) [handoff] {failures}
+  LEAKED   (never finished or cancelled)          (only when unresolved)
+...
+Non-continuation scopes:
+  scope#<seq> ...
+```
+
+Every record is listed (not just flagged ones), so you can reconstruct the full graph whether or not
+anything leaked. `+Δms` is relative to the first recorded event. (`renderSummary()` — the
+problem-only view — still backs `assertNoLeaks` failure messages, but the timeline is the feed.)
+
+## Step 5 — Summarize ("resume")
+
+Give a plain-language recap:
+
+- The header counts (leaked / late / double / activate-after-resolve / never-closed / wrong-thread).
+- The dominant flow: where continuations are captured (callsite/thread) and where they're resumed /
+  resolved (thread), plus any thread handoffs.
+- For each flagged record (if any): its failure set and capture/open callsite (cite `file:line`).
+- A one-line hypothesis when there's a problem: which advice captured the continuation and where it
+  should have resolved it.
+
+## Step 6 — Visualize (auto-pick, user may override)
+
+Build the diagram from the **timeline** (works whether or not there are leaks):
+
+- **Gantt** — when the signal is **temporal / cross-thread** (thread handoffs, late-after-root,
+  never-closed, or the user wants the time view). Mermaid `gantt`, one `section` per thread; a bar
+  per continuation from capture→resolve and per scope from open→close using the `+Δms` offsets. Mark
+  leaks / never-closed `crit` to the window end; late / wrong-thread `active`; resolved-on-time
+  `done`; capture-only points as `milestone`.
+- **DAG** — when the signal is **structural / ownership** (orphans, double-finish,
+  activate-after-resolve, or continuation→scope lineage). Mermaid `flowchart LR`: a node per
+  continuation (`#seq spanName`), its spawned scopes (linked via the nested `scope#` lines), edges
+  capture→resume→resolve labelled with thread + `+Δms`. Color leaked / double red, late amber,
+  resolved green.
+
+If there are no problems, the diagram simply shows the healthy capture→continue→resolve flow (all
+green) — that is the expected "regardless of leak" output. If unsure which shape, ask with
+`AskUserQuestion`.
+
+## Step 7 — Revert
+
+Undo the temporary annotation so the working tree is clean:
+
+```bash
+git checkout -- <test file path from Step 2>
+```
+
+Report: the findings summary, the diagram, and that the annotation was reverted.
@@ -34,6 +34,8 @@ import datadog.metrics.impl.DDSketchHistograms
 import datadog.metrics.impl.MonitoringImpl
 import datadog.trace.agent.test.asserts.ListWriterAssert
 import datadog.trace.agent.test.asserts.TagsAssert
+import datadog.trace.agent.test.scopediag.ScopeDiagnostics
+import datadog.trace.agent.test.scopediag.TrackScopeContinuations
 import datadog.trace.agent.test.datastreams.MockFeaturesDiscovery
 import datadog.trace.agent.test.datastreams.RecordingDatastreamsPayloadWriter
 import datadog.trace.agent.tooling.AgentInstaller
@@ -112,6 +114,9 @@ import spock.lang.Shared
 @SuppressWarnings('UnnecessaryDotClass')
 @ExtendWith(TestClassShadowingExtension.class)
 @ExtendWith(TooManyInvocationsErrorHandler.class)
+// Track scope continuations for every instrumentation spec (report-only: failOnLeak defaults to
+// false). @Inherited, so all specs get the full timeline dumped after each test.
+@TrackScopeContinuations
 abstract class InstrumentationSpecification extends DDSpecification implements AgentBuilder.Listener {
   private static final long TIMEOUT_MILLIS = TimeUnit.SECONDS.toMillis(20)
 
@@ -466,6 +471,9 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
     }
 
     TEST_WRITER.start()
+    if (scopeDiagConfig() != null) {
+      ScopeDiagnostics.startRecording()
+    }
     TEST_DATA_STREAMS_WRITER.clear()
     TEST_DATA_STREAMS_MONITORING.clear()
 
@@ -499,6 +507,8 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
     }
     TEST_TRACER.flush()
 
+    reportScopeDiagnostics()
+
     def util = new MockUtil()
     util.detachMock(STATS_D_CLIENT)
 
@@ -522,6 +532,34 @@ abstract class InstrumentationSpecification extends DDSpecification implements A
     assert InstrumentationErrors.noErrors(): InstrumentationErrors.describeErrors()
   }
 
+  /** Resolves the {@link TrackScopeContinuations} annotation from the feature method or spec class. */
+  private TrackScopeContinuations scopeDiagConfig() {
+    def method = specificationContext?.currentFeature?.featureMethod?.reflection
+    def ann = method?.getAnnotation(TrackScopeContinuations)
+    if (ann == null) {
+      ann = this.class.getAnnotation(TrackScopeContinuations)
+    }
+    return ann
+  }
+
+  private void reportScopeDiagnostics() {
+    def config = scopeDiagConfig()
+    if (config == null) {
+      return
+    }
+    ScopeDiagnostics.stop()
+    def report = ScopeDiagnostics.report()
+    // Always dump the full timeline so a graph/report can be built regardless of leaks.
+    println(report.renderTimeline())
+    try {
+      if (config.failOnLeak()) {
+        ScopeDiagnostics.assertNoLeaks()
+      }
+    } finally {
+      ScopeDiagnostics.reset()
+    }
+  }
+
   private void doCheckRepeatedFinish() {
     for (Map.Entry<DDSpan, List<Exception>> entry: this.spanFinishLocations.entrySet()) {
       if (entry.value.size() == 1) {

@@ -7,6 +7,8 @@
 import datadog.instrument.classinject.ClassInjector;
 import datadog.trace.agent.test.assertions.TraceAssertions;
 import datadog.trace.agent.test.assertions.TraceMatcher;
+import datadog.trace.agent.test.scopediag.ScopeDiagnosticsExtension;
+import datadog.trace.agent.test.scopediag.TrackScopeContinuations;
 import datadog.trace.agent.tooling.AgentInstaller;
 import datadog.trace.agent.tooling.InstrumenterModule;
 import datadog.trace.agent.tooling.TracerInstaller;
@@ -54,7 +56,14 @@
  * </ul>
  */
 @WithConfig(key = "detailed.instrumentation.errors", value = "true")
-@ExtendWith({TestClassShadowingExtension.class, AllowContextTestingExtension.class})
+@ExtendWith({
+  TestClassShadowingExtension.class,
+  AllowContextTestingExtension.class,
+  ScopeDiagnosticsExtension.class
+})
+// Track scope continuations for every instrumentation test (report-only: failOnLeak defaults to
+// false). @Inherited, so all subclasses get the full timeline dumped after each test.
+@TrackScopeContinuations
 public abstract class AbstractInstrumentationTest {
   static final Instrumentation INSTRUMENTATION = ByteBuddyAgent.getInstrumentation();
 

@@ -0,0 +1,39 @@
+package datadog.trace.agent.test.scopediag;
+
+import net.bytebuddy.asm.Advice;
+
+/**
+ * Test-only ByteBuddy advice woven into {@code datadog.trace.core.scopemanager.ContinuableScope}
+ * (and, by inheritance, {@code ContinuingScope}) to track the scope activation lifecycle.
+ *
+ * <p>The target type is package-private, so {@code this} is typed as {@link Object} and re-cast
+ * inside {@link ScopeContinuationProbe}. {@code afterActivated} is the open point (first call per
+ * scope identity), {@code onProperClose} the pop, and {@code close} the wrong-thread check.
+ */
+public final class ContinuableScopeAdvice {
+  private ContinuableScopeAdvice() {}
+
+  /** {@code afterActivated()} — the scope became active. */
+  public static final class AfterActivated {
+    @Advice.OnMethodExit(suppress = Throwable.class)
+    public static void exit(@Advice.This Object scope) {
+      ScopeContinuationProbe.onScopeOpen(scope);
+    }
+  }
+
+  /** {@code onProperClose()} — the scope was popped from its thread's stack. */
+  public static final class OnProperClose {
+    @Advice.OnMethodExit(suppress = Throwable.class)
+    public static void exit(@Advice.This Object scope) {
+      ScopeContinuationProbe.onScopeClose(scope);
+    }
+  }
+
+  /** {@code close()} — check for an out-of-order / wrong-thread close. */
+  public static final class Close {
+    @Advice.OnMethodEnter(suppress = Throwable.class)
+    public static void enter(@Advice.This Object scope) {
+      ScopeContinuationProbe.onScopeClosing(scope);
+    }
+  }
+}
@@ -0,0 +1,76 @@
+package datadog.trace.agent.test.scopediag;
+
+import net.bytebuddy.asm.Advice;
+
+/**
+ * Test-only ByteBuddy advice woven into {@code datadog.trace.core.scopemanager.ScopeContinuation}.
+ *
+ * <p>The target type is package-private and cannot be named here, so {@code this} is typed as
+ * {@link Object} and re-cast to the public {@code AgentScope.Continuation} supertype inside {@link
+ * ScopeContinuationProbe}. {@link Advice.FieldValue} reads the private {@code count} field — legal
+ * because the advice is inlined into the field's own class.
+ */
+public final class ContinuationAdvice {
+  private ContinuationAdvice() {}
+
+  /** {@code register()} — the continuation was captured. */
+  public static final class Register {
+    @Advice.OnMethodExit(suppress = Throwable.class)
+    public static void exit(@Advice.This Object self) {
+      ScopeContinuationProbe.onCapture(self);
+    }
+  }
+
+  /**
+   * {@code activate()} — a (possibly noop) activation; the probe filters the rollback branch.
+   *
+   * <p>The activation timestamp is captured at method <em>entry</em>, not exit: the same-span reuse
+   * optimization ({@code ContinuableScopeManager.continueSpan}) cancels the continuation from
+   * <em>inside</em> {@code activate()} before it returns, so timestamping the resume at exit would
+   * order it after that internal resolution and spuriously flag {@code ACTIVATE_AFTER_RESOLVE}.
+   */
+  public static final class Activate {
+    @Advice.OnMethodEnter
+    public static long enter() {
+      return System.nanoTime();
+    }
+
+    @Advice.OnMethodExit(suppress = Throwable.class)
+    public static void exit(
+        @Advice.This Object self, @Advice.Enter long ddActivateNanos, @Advice.Return Object scope) {
+      ScopeContinuationProbe.onActivate(self, scope, ddActivateNanos);
+    }
+  }
+
+  /**
+   * Resolution detected via the {@code count} transition. Applied to both {@code cancel()} and
+   * {@code cancelFromContinuedScopeClose()} — they need identical before/after observation. The
+   * originating method name ({@code #m}) distinguishes an explicit cancel from a normal
+   * finish-on-scope-close.
+   *
+   * <p>The resolve timestamp is captured at method <em>entry</em> (the {@code ddResolveNanos}
+   * local), not at exit: the body itself may call {@code removeContinuation() ->
+   * PendingTrace.write()}, which is exactly where the root-written timestamp is taken. Timestamping
+   * at exit would place the resolution after the root write it triggered, producing a spurious
+   * late-finish.
+   */
+  public static final class Cancel {
+    @Advice.OnMethodEnter
+    public static int enter(
+        @Advice.FieldValue("count") int count,
+        @Advice.Local("ddResolveNanos") long ddResolveNanos) {
+      ddResolveNanos = System.nanoTime();
+      return count;
+    }
+
+    @Advice.OnMethodExit(suppress = Throwable.class)
+    public static void exit(
+        @Advice.This Object self,
+        @Advice.Origin("#m") String method,
+        @Advice.Enter int countBefore,
+        @Advice.Local("ddResolveNanos") long ddResolveNanos,
+        @Advice.FieldValue("count") int countAfter) {
+      ScopeContinuationProbe.onResolve(self, method, countBefore, countAfter, ddResolveNanos);
+    }
+  }
+}