Skip to content

profiling: request carrier-scoped ddprof context storage on JDK 21+#11826

Open
rkennke wants to merge 3 commits into
masterfrom
rkennke/PROF-15271
Open

profiling: request carrier-scoped ddprof context storage on JDK 21+#11826
rkennke wants to merge 3 commits into
masterfrom
rkennke/PROF-15271

Conversation

@rkennke

@rkennke rkennke commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What Does This Do

On JDK 21+, before the ddprof profiler is loaded, this:

  1. exports java.base/jdk.internal.misc to the classloader that loads com.datadoghq.profiler.*, and
  2. sets the profiler's ddprof.context.storage.mode system property to carrier (only when unset).

Together these request carrier-scoped OTEL context storage in the profiler, backed by jdk.internal.misc.CarrierThreadLocal.

Implemented as a small prepareDatadogProfilerContextStorage(Instrumentation) helper in Agent, invoked in InstallDatadogTracerCallback right before installDatadogTracer (which triggers the profiler load, so the export/property are in place before JavaProfiler is constructed).

Motivation

The profiler exposes its OTEL context ThreadContext as a DirectByteBuffer over the carrier thread's native record — the record the (carrier-bound) sampler reads. When that storage is keyed by the virtual thread (the current default), a mounted vthread pins to whichever carrier it first ran on:

  • Wrong after migration — writes land on the old carrier, so a sampler on the new carrier sees stale/empty context; and
  • Unsafe once the old carrier's OS thread exits — the native record is freed while the buffer keeps being written, a use-after-free that can corrupt JVM-owned native memory (seen in the field as a SIGSEGV in ThreadsSMRSupport::free_list under a Loom workload).

The java-profiler fix (PROF-15271) adds carrier-scoped storage via CarrierThreadLocal, so a mounted vthread always resolves to its current carrier's live record. CarrierThreadLocal is in a non-exported package, hence the export; and the profiler defaults to auto (graceful), so to get the strict carrier mode (which fails fast if it cannot be honored) dd-trace-java must request it explicitly.

Additional Notes

  • Safe to ship ahead of the profiler bump. Both actions are inert against the currently-bundled ddprof (1.44.0): the property is ignored and the export goes unused. They activate automatically when a ddprof release containing the fix is bumped in libs.versions.toml — that version bump is the real activation gate and where CARRIER + fail-fast should be validated on JDK 21+.
  • Fail-fast is inherited, not implemented here. With mode=carrier, the profiler's create() throws if CarrierThreadLocal is inaccessible; the existing initJavaProfiler() try/catch turns that into reasonNotLoadedUnsupportedOperationException, so ddprof disables itself cleanly rather than crashing the app.
  • Single kill-switch, no new config key. Operators disable/override via the profiler's own -Dddprof.context.storage.mode=thread (or =auto); we only set the property when unset, so an explicit choice always wins. No dd.* passthrough was added (a thin one is a trivial follow-up if env-var/remote-config control is later wanted).
  • Guarded by isProfilingEnabled() && isDatadogProfilerEnabled() && !isWindows() && isJavaVersionAtLeast(21). JDK9ModuleAccess (internal-api-9) is only referenced under the 21+ guard, so it never loads on Java 8. Uses SystemProperties.get/set (not raw System.*).
  • Exporting jdk.internal.misc via redefineModule follows existing precedent (AdvancedAgentChecks does the same for CDS detection).

Follow-up: bump ddprof in libs.versions.toml to the release containing the fix (separate PR), plus a JDK 21+ test asserting carrier mode is active once bumped.

Contributor Checklist

  • Title formatted per contribution guidelines
  • Assign type: / comp: labels
  • No close/fix linking keywords used

Jira ticket: PROF-15271

🤖 Generated with Claude Code

…PROF-15271)

The ddprof profiler exposes its OTEL context ThreadContext as a DirectByteBuffer
over the carrier thread's native record. When that storage is keyed by the
virtual thread (the pre-fix default), a mounted vthread pins to its first
carrier: writes land on the wrong carrier after migration, and become a
use-after-free once that carrier's OS thread exits. The java-profiler fix
(PROF-15271) adds carrier-scoped storage via jdk.internal.misc.CarrierThreadLocal,
selected by the ddprof.context.storage.mode system property.

Before the profiler is loaded, on JDK 21+ (profiler enabled, non-Windows):
- export java.base/jdk.internal.misc to the classloader that loads
  com.datadoghq.profiler.* so CarrierThreadLocal is reachable, and
- set ddprof.context.storage.mode=carrier (only if unset), which requests
  carrier scoping and makes the profiler fail fast if it cannot honor it.

Both actions are inert against ddprof builds that predate carrier support (the
property is ignored; the export goes unused), so this is safe to ship ahead of
the ddprof version bump — it activates automatically when that lands.

Operators opt out with -Dddprof.context.storage.mode=thread (or =auto); we only
set the property when unset, so an explicit choice always wins. No separate dd.*
config key: the profiler's own system property is the single kill-switch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jul 1, 2026

Copy link
Copy Markdown

🎯 Code Coverage (details)
Patch Coverage: 0.00%
Overall Coverage: 55.02% (-0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 7c7fc81 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.93 s 13.99 s [-1.3%; +0.4%] (no difference)
startup:insecure-bank:tracing:Agent 12.93 s 13.02 s [-1.6%; +0.1%] (no difference)
startup:petclinic:appsec:Agent 16.91 s 16.53 s [+1.5%; +3.2%] (significantly worse)
startup:petclinic:iast:Agent 16.79 s 16.87 s [-1.2%; +0.4%] (no difference)
startup:petclinic:profiling:Agent 16.70 s 16.80 s [-1.5%; +0.3%] (no difference)
startup:petclinic:sca:Agent 16.92 s 16.60 s [+0.9%; +2.9%] (maybe worse)
startup:petclinic:tracing:Agent 16.00 s 16.23 s [-2.6%; -0.2%] (maybe better)

Commit: 7c7fc819 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@rkennke rkennke added comp: profiling Profiling type: bug Bug report and fix labels Jul 1, 2026
…dogTracer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rkennke rkennke added the tag: no release notes Changes to exclude from release notes label Jul 1, 2026
@rkennke rkennke marked this pull request as ready for review July 1, 2026 12:18
@rkennke rkennke requested a review from a team as a code owner July 1, 2026 12:18
@rkennke rkennke requested a review from ygree July 1, 2026 12:18
@dd-octo-sts dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label Jul 1, 2026
…ROF-15271)

The java-profiler side renamed the selector to ddprof.debug.context.storage.mode
(ddprof.debug.*, signalling an internal knob). Match the property name we export
and set here so the carrier request still takes effect.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: profiling Profiling tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant