Skip to content

Store known span tags densely in TagMap by tag-id#11814

Draft
dougqh wants to merge 14 commits into
masterfrom
dougqh/dense-store
Draft

Store known span tags densely in TagMap by tag-id#11814
dougqh wants to merge 14 commits into
masterfrom
dougqh/dense-store

Conversation

@dougqh

@dougqh dougqh commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Draft for preservation — not for merge yet. Per plan: lands after next week's release, with both the read-through and dense pieces behind experimental flags (off by default, flipped on internally to bake).

This branch is stacked: read-through (level-split) commits + the dense known-tag store on top.

Read-through (level-split phase 1) — stacked, tracked separately

The mechanism is draft #11789. This branch also carries the consumer wiring (mergedTracerTags as a read-through parent) + benchmark; those will be split into the read-through PR(s), which land first. Still TODO: add an experimental flag gating the read-through parent wiring (currently gated only on !needsIntercept).

Dense known-tag store (this PR's focus)

Known tags (those KnownTags.keyOf resolves to a stored id) store in dense parallel arrays (knownIds/knownValues, insertion-ordered, cap-8 ×2) with no per-tag Entry — the #1 tracer allocator. Coexists with the hash buckets (arbitrary tags); disjoint by global known-ness, so read-through shadow checks stay within-region.

  • keyOf substrateKnownTags (tagId encoding) + KnownTagIds (registry + open-addressed keyOf via StringIndex)
  • dense store woven into OptimizedTagMap (set/get/remove/forEach/size/copy/putAll/clear/iterate)
  • lazy buckets — shared EMPTY_BUCKETS sentinel + copy-on-write; all-known maps allocate zero buckets
  • flyweight EntryReader iterator — alloc-free dense reads on the serialize path; entrySet() keeps real retain-safe Entry
  • dormant behind dd.trace.dense.tags.enabled (experimental system property; → promote to a Config flag before ship)

Validation

  • Deterministic -prof gc (real resolver, realistic mixed span): per-build alloc −8% (7 tags) → −41% (12 tags); all-known / trace-tier shape −57%/−43%.
  • PetClinic macro (Xmx96m, counterbalanced A-B-B-A): GC pauses −2.8% (vs +6.6% before the flyweight fix), throughput break-even.
  • Coverage: OptimizedTagMapDenseForkedTest (real registry + read-through union), TagMapDenseFuzzForkedTest (3 key regimes, HashMap oracle + checkIntegrity), DenseStoreAllocBenchmark.

The CPU upside (skip keyOf by setting tags via setTag(KnownTagIds.X)) arrives via incremental instrumentation migration over time — no big-bang.

🤖 Generated with Claude Code

dougqh and others added 14 commits June 29, 2026 15:33
Adds optional read-through to OptimizedTagMap: a child map references a frozen
parent (withParent) and reads through to it on a local miss, while local entries
shadow the parent. The enabler for level-split phase 1 (a span no longer copies
the shared trace-level tags down per span).

Single-parent by design in phase 1 (anti-false-generalization); written so
generalizing to multiple flattened parents is additive. Inert when parent == null
(every existing map), so no behavior change off the read-through path.

- Read path: getEntry falls through to the parent on a local miss (get*/containsKey
  inherit it). isDefinitelyEmpty()/estimateSize() added as cheap conservative/upper-
  bound variants; isEmpty()/size() stay exact (Map contract) and resolve the union.
- Removal: a lazily-allocated removedFromParent side-set (also the gate) tombstones
  a parent-exposed key so it no longer reads through; re-setting clears it. Entry and
  BucketGroup stay untouched (the side-set is shape-agnostic vs the bare/group duality).
- Bulk reads (forEach x3, iterators, collection views): bucket-aligned self-vs-parent
  merge, first-occurrence-wins, exploiting universal hashing so the shadow check is
  scoped to the same-index local bucket (no re-hash, no global seen-set). Alloc-free.
- checkIntegrity asserts the local emptiness invariant (size==0), not union isEmpty().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
copy() went through putAllIntoEmptyMap, which clones only the local buckets and
the local size — so a copy of a read-through map dropped the inherited parent
tags and the tombstones. Fix copy() to share the (frozen, immutable) parent and
copy the tombstone set, so the copy is observationally identical to the original
(same union) and independently mutable.

Adds behavior-identical-to-flat-map tests: copy equivalence, independent
mutability, tombstone preservation, equality with an equivalent flat map, and
immutableCopy of a read-through map. This is the safety contract for flipping
mergedTracerTags to a parent in the consumer change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The read-through parent loop appended to each forEach variant grew forEach to
~242 bytes. Move it into per-variant forEachParent(...) methods called only when
parent != null, so the common parent == null forEach is unchanged (~90 bytes,
inlines/compiles exactly as before) and the parent-loop code lives out of line.

Confirmed via PrintInlining: the hot read chain getInt -> getIntOrDefault ->
getEntry -> getLocalEntry -> findInBucket all inline (hot); forEach compiles as
its own loop unit as before.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Version is conditionally managed by InternalTagsAdder (added only when
service == DD_SERVICE and not set during the request). It was being applied to
every span via the trace-level bundle, then immediately stripped by a per-span
removeTag(VERSION) — apply-then-remove, net zero.

Exclude it from the bundle once at config build (withTracerTags) instead. This
is behavior-preserving today, and it prepares the bundle to become a read-through
parent: a removeTag on a key that lived in the parent would mint a per-span
tombstone. With version excluded, the retained per-span removeTag(VERSION) (which
still wipes a builder-set version, preserving the existing semantics) is a cheap
local op, never a tombstone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Level-split phase 1: when the trace-level bundle has no interceptable tags
(!mergedTracerTagsNeedsIntercept), attach it as a read-through parent of the
span's tags (shared by reference) instead of copying its entries into every
span. When it does need interception, fall back to the copy path (the
interceptor's per-span side-effects can't be shared).

- TagMap.withParent promoted to the interface (was package-private on
  OptimizedTagMap) — the public wiring deferred from the mechanism change.
- DDSpanContext.parentTags attaches the (frozen) parent to unsafeTags.
- CoreTracer span-build gates copy-vs-share on needsIntercept.

mergedTracerTags is the precedence floor (overridden by all other contributors),
so attaching it as the lowest-precedence parent preserves ordering. Version was
already excluded from the bundle, so the per-span removeTag never tombstones.
Tag-touching tests (CoreSpanBuilder/CoreTracer/DDSpan/InternalTagsAdder) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Models span-build tag assembly: copyDown (putAll the frozen trace-level bundle
into a fresh span map, then span tags) vs readThrough (attach the bundle as a
read-through parent, span tags local only). Run with -prof gc to isolate the
per-span allocation read-through saves. Swept by traceTagCount {3,7,15} so the
win can be seen scaling with the bundle size; 7 ~ a realistic mergedTracerTags.

Quick read (noisy box): readThrough ~38% less alloc (deterministic) and ~1.7x
throughput. The alloc delta is bucket structure (BucketGroup clones + fewer local
collisions), not Entry objects -- putAll-into-empty already shares frozen entries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic open-addressed string->id/payload index (datadog.trace.util.StringIndex)
+ tests, footprint test (JOL), and the switch-vs-index benchmark. This is the
keyOf (name->id) substrate the dense store routes known tags through. Ported onto
the post-kill-Legacy + read-through base (dense-store branch). Adds the
org.openjdk.jol:jol-core test dep for the footprint test.

Not yet wired into keyOf/KnownTagIds — that + the dense-store port come next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Slice 1 of the dense-store rebuild. Ports the name<->id resolution
substrate from the (pre-kill-Legacy, far-behind) attribute-value-table
branch onto the post-kill-Legacy + read-through base:

- KnownTags: tagId encoding (globalSerial/fieldPos/nameHash/INTERCEPTED)
  + the Resolver registry. Depends only on TagMap.Entry._hash.
- KnownTagIds: the 27-tag hand-assigned registry + open-addressed keyOf
  table, adapted from the old TagSet to our StringIndex (a clean
  superset). keyOf uses the static-final-raw-arrays Support path (JIT
  folds the refs).
- KnownTagIdsTest: JUnit 5 parity (92 cases) — keyOf<->nameOf round-trip
  for all 27 tags, nameHash == Entry._hash, INTERCEPTED flags,
  reserved-vs-stored, serial uniqueness, unknown->0/null.

No dense store consumes this yet. The universal fieldPos/SLOT_COUNT
layout is ported but marked PROVISIONAL/dormant: keyOf/nameOf depend
only on globalSerial + name, not fieldPos, so the ids are stable across
any layout scheme. The dense-store slice supersedes the universal layout
with role/type sizing (per the over-provision finding).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Slice 2a of the dense-store rebuild. Known tags (those KnownTags.keyOf
resolves to a stored id) now live in dense parallel arrays
(knownIds[]/knownValues[], insertion-ordered, cap-8 x2) with NO per-tag
Entry object -- the allocation win, attacking the #1 tracer allocator.
Linear scan by globalSerial (reads aren't hot; positional/role-sizing
deferred -> no codegen). Coexists with the hash buckets, which hold
arbitrary tags.

Disjoint by construction: known-ness is global (keyOf deterministic), so
a known tag is always dense and never bucketed. That keeps read-through
shadow checks within-region -- a parent dense entry can only be shadowed
by a local dense entry of the same id, a parent bucket entry only by a
local bucket entry -- so the existing bucket read-through code is
unchanged; only local-dense and parent-dense emission are added.

Woven through: getAndSet/getLocalEntry/removeLocal (route via
keyOf/isStored), forEach x3 (local+parent dense via a reused
EntryReadingHelper flyweight -> alloc-free serialize; the iterator
materializes per-slot, the rare/compat path), size/isEmpty/estimateSize/
isDefinitelyEmpty/visibleParentCount, copy/putAll/clear/fillMap/
fillStringMap/checkIntegrity. size stays bucket-only; knownCount tracks
dense; local total = size + knownCount.

DORMANT in production: no resolver is registered, so keyOf returns 0,
nothing routes to dense, and behavior is byte-identical to today.
Validated by OptimizedTagMapDenseForkedTest (isolated JVM, resolver
live) -- 12 scenarios incl. the read-through union (read-through to
parent dense, child-dense shadows parent-dense, remove-tombstones-
parent-dense). Caught + fixed an accounting bug: putAllOptimizedMap's
empty-check ignored knownCount, clobbering a dense-only map on merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reuses TagMapFuzzTest's oracle machinery (replay a random action
sequence against a HashMap, verify each step + checkIntegrity) but runs
it with a LIVE resolver so the dense store actually engages -- the base
fuzz uses only key-N names, which never resolve to a known id.

Uses a synthetic prefix resolver (known-N -> stored/dense, else bucket)
instead of the real KnownTagIds: unbounded known key space (so the dense
array grows past cap-8 and the linear scan gets long) + precise routing
control. Three regimes cover paths the mixed run alone misses:
- known-only: the all-dense map (growth, dense-only putAll/copy/clear/
  iterate, knownCount-only size -- where the empty-check bug hid)
- custom-only: dense branches stay inert under a live resolver
- mixed: both regions + interaction

Each regime: 1500 single-map + 400 merge cases. Forked (isolated JVM)
since resolver registration is a global static. Complements
OptimizedTagMapDenseForkedTest, which validates the real KnownTagIds
registry + the read-through union.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Registering the KnownTags resolver is what flips the (already-woven,
validated) dense store from dormant to live: once registered, keyOf
resolves known tags and they store without a per-tag Entry. Nothing on
this branch references KnownTagIds yet, so it stays dormant.

Adds KnownTagIds.init() (invoking it runs <clinit> -> register,
idempotent) and calls it from CoreTracer construction gated by the
system property dd.trace.dense.tags.enabled (default false), alongside
the existing PublishState preload. Off -> keyOf is a no-op, tag storage
byte-identical to today; on -> dense store engages.

Same-jar A/B for benchmarking: run PetClinic with the property false
(baseline) vs true (treatment). Expected: alloc win (Entry eliminated)
regardless of integration migration, since keyOf is CPU-only (a
StringIndex probe, zero alloc); throughput plausibly break-even
un-migrated (read-through already removed the per-span trace-tag merge,
offsetting the transitional keyOf tax), with CPU upside reserved for
when integrations adopt setTag(long). Promote to a Config flag for a
permanent rollout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
set/getAndSet(String, ...) now resolve keyOf FIRST and route a known tag
straight to the dense store, constructing no Entry; only the custom
fallback builds the typed Entry (preserving no-boxing for primitives).
getAndSet(Entry) splits into getAndSetKnown / getAndSetBucket so the
entry-based callers share the same routing. This was alloc-neutral in a
shallow microbenchmark (escape analysis already scalar-replaced the
throwaway Entry), but makes the win EA-independent on the deep real hot
path (DDSpanContext.setTag -> ...), where inline-budget pressure can keep
EA from firing.

DenseStoreAllocBenchmark: deterministic -prof gc A/B, today (all bucket)
vs dense (~70% real known tag names), using the REAL allocation-free
KnownTagIds resolver. Result (B/op): 7 tags 408->376 (-8%), 12 tags
704->416 (-41%); buildAndSerialize == buildMap (flyweight emit is
alloc-free). The win is real and scales with known-tag count. (An earlier
synthetic prefix resolver allocated in keyOf/nameOf and masked it.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
buckets is no longer eagerly allocated: every map points at a shared
immutable EMPTY_BUCKETS (length 16, all null) until its first custom-tag
write copies-on-write to a private array (materializeBuckets). An
all-known / known-heavy map -- notably the trace-tier read-through parent
(all trace-constant known tags) -- now allocates ZERO buckets, realizing
the design's "all-known allocates zero buckets". Length stays 16, so
reads need no null guard and read-through bucket alignment (hash & 15)
holds; only the four write sites COW (getAndSetBucket, putAllMerge,
putAllIntoEmptyMap is gated on source having bucket entries, clear resets
to the sentinel).

EmptyHolder allocates its OWN length-16 array rather than reading
EMPTY_BUCKETS: the nested holder can initialize before OptimizedTagMap's
<clinit> sets that static (same static-init-order hazard as the EMPTY
cycle), which would leave EMPTY with null buckets -- caught by the fuzz
(buildImmutable returns TagMap.EMPTY; iterating it NPE'd).

-prof gc (buildMap, B/op), real resolver: all-known 7 tags 408->176
(-57%), 12 tags 704->400 (-43%); realistic mixed span unchanged at
376/416 (it materializes buckets for its custom tags anyway). Full
internal-api test + forkedTest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The EntryReader iterator (TagMap.iterator()/keySet()/values()) now emits
dense entries via a reused flyweight instead of materializing an Entry
per dense tag. EntryReader is a "use-now" view, so this is contract-safe.
This was the PetClinic macro regression: the real serializer's count
pre-pass (TraceMapperV0_4:95 / V0_5:238) iterates `for (EntryReader :
tags)`, which materialized one Entry per dense tag per span at serialize
-- clawing back the build alloc win and then some.

entrySet() (Iterator<Map.Entry>) keeps returning real, retain-safe Entry
objects: EntriesIterator sits on top of the EntryReader iterator and
materializes via .entry() per next() (bucket -> the real stored Entry,
free; dense -> a fresh Entry). Deliberately not alloc-optimized -- bulk
reads use forEach/EntryReader; manual instrumentation does point get/set.
Replaces the old `(Iterator) map.iterator()` cast, which would hand out
reused flyweights as Map.Entry once the EntryReader iterator flyweights.

denseReader flyweight is lazily created (first dense emit), so bucket-only
iterators allocate nothing extra.

-prof gc, iterator-serialize path (buildAndSerializeViaIterator), dense:
12 tags 736->440 B/op (now -40% vs today, was +5%); allKnown 12t 936->480
(-34%, was +33%). Win restored on the materializing path. Full
internal-api test + forkedTest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels Jun 30, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.95 s 13.89 s [-0.3%; +1.1%] (no difference)
startup:insecure-bank:tracing:Agent 12.94 s 12.99 s [-1.4%; +0.5%] (no difference)
startup:petclinic:appsec:Agent 16.87 s 16.48 s [+1.3%; +3.4%] (significantly worse)
startup:petclinic:iast:Agent 16.85 s 16.93 s [-1.3%; +0.4%] (no difference)
startup:petclinic:profiling:Agent 16.85 s 16.14 s [+0.1%; +8.7%] (maybe worse)
startup:petclinic:sca:Agent 16.88 s 16.73 s [+0.1%; +1.7%] (maybe worse)
startup:petclinic:tracing:Agent 16.19 s 16.11 s [-0.4%; +1.4%] (no difference)

Commit: 1c0a96e9 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant