Store known span tags densely in TagMap by tag-id#11814
Draft
dougqh wants to merge 14 commits into
Draft
Conversation
Adds optional read-through to OptimizedTagMap: a child map references a frozen parent (withParent) and reads through to it on a local miss, while local entries shadow the parent. The enabler for level-split phase 1 (a span no longer copies the shared trace-level tags down per span). Single-parent by design in phase 1 (anti-false-generalization); written so generalizing to multiple flattened parents is additive. Inert when parent == null (every existing map), so no behavior change off the read-through path. - Read path: getEntry falls through to the parent on a local miss (get*/containsKey inherit it). isDefinitelyEmpty()/estimateSize() added as cheap conservative/upper- bound variants; isEmpty()/size() stay exact (Map contract) and resolve the union. - Removal: a lazily-allocated removedFromParent side-set (also the gate) tombstones a parent-exposed key so it no longer reads through; re-setting clears it. Entry and BucketGroup stay untouched (the side-set is shape-agnostic vs the bare/group duality). - Bulk reads (forEach x3, iterators, collection views): bucket-aligned self-vs-parent merge, first-occurrence-wins, exploiting universal hashing so the shadow check is scoped to the same-index local bucket (no re-hash, no global seen-set). Alloc-free. - checkIntegrity asserts the local emptiness invariant (size==0), not union isEmpty(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
copy() went through putAllIntoEmptyMap, which clones only the local buckets and the local size — so a copy of a read-through map dropped the inherited parent tags and the tombstones. Fix copy() to share the (frozen, immutable) parent and copy the tombstone set, so the copy is observationally identical to the original (same union) and independently mutable. Adds behavior-identical-to-flat-map tests: copy equivalence, independent mutability, tombstone preservation, equality with an equivalent flat map, and immutableCopy of a read-through map. This is the safety contract for flipping mergedTracerTags to a parent in the consumer change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The read-through parent loop appended to each forEach variant grew forEach to ~242 bytes. Move it into per-variant forEachParent(...) methods called only when parent != null, so the common parent == null forEach is unchanged (~90 bytes, inlines/compiles exactly as before) and the parent-loop code lives out of line. Confirmed via PrintInlining: the hot read chain getInt -> getIntOrDefault -> getEntry -> getLocalEntry -> findInBucket all inline (hot); forEach compiles as its own loop unit as before. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Version is conditionally managed by InternalTagsAdder (added only when service == DD_SERVICE and not set during the request). It was being applied to every span via the trace-level bundle, then immediately stripped by a per-span removeTag(VERSION) — apply-then-remove, net zero. Exclude it from the bundle once at config build (withTracerTags) instead. This is behavior-preserving today, and it prepares the bundle to become a read-through parent: a removeTag on a key that lived in the parent would mint a per-span tombstone. With version excluded, the retained per-span removeTag(VERSION) (which still wipes a builder-set version, preserving the existing semantics) is a cheap local op, never a tombstone. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Level-split phase 1: when the trace-level bundle has no interceptable tags (!mergedTracerTagsNeedsIntercept), attach it as a read-through parent of the span's tags (shared by reference) instead of copying its entries into every span. When it does need interception, fall back to the copy path (the interceptor's per-span side-effects can't be shared). - TagMap.withParent promoted to the interface (was package-private on OptimizedTagMap) — the public wiring deferred from the mechanism change. - DDSpanContext.parentTags attaches the (frozen) parent to unsafeTags. - CoreTracer span-build gates copy-vs-share on needsIntercept. mergedTracerTags is the precedence floor (overridden by all other contributors), so attaching it as the lowest-precedence parent preserves ordering. Version was already excluded from the bundle, so the per-span removeTag never tombstones. Tag-touching tests (CoreSpanBuilder/CoreTracer/DDSpan/InternalTagsAdder) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Models span-build tag assembly: copyDown (putAll the frozen trace-level bundle
into a fresh span map, then span tags) vs readThrough (attach the bundle as a
read-through parent, span tags local only). Run with -prof gc to isolate the
per-span allocation read-through saves. Swept by traceTagCount {3,7,15} so the
win can be seen scaling with the bundle size; 7 ~ a realistic mergedTracerTags.
Quick read (noisy box): readThrough ~38% less alloc (deterministic) and ~1.7x
throughput. The alloc delta is bucket structure (BucketGroup clones + fewer local
collisions), not Entry objects -- putAll-into-empty already shares frozen entries.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic open-addressed string->id/payload index (datadog.trace.util.StringIndex) + tests, footprint test (JOL), and the switch-vs-index benchmark. This is the keyOf (name->id) substrate the dense store routes known tags through. Ported onto the post-kill-Legacy + read-through base (dense-store branch). Adds the org.openjdk.jol:jol-core test dep for the footprint test. Not yet wired into keyOf/KnownTagIds — that + the dense-store port come next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Slice 1 of the dense-store rebuild. Ports the name<->id resolution substrate from the (pre-kill-Legacy, far-behind) attribute-value-table branch onto the post-kill-Legacy + read-through base: - KnownTags: tagId encoding (globalSerial/fieldPos/nameHash/INTERCEPTED) + the Resolver registry. Depends only on TagMap.Entry._hash. - KnownTagIds: the 27-tag hand-assigned registry + open-addressed keyOf table, adapted from the old TagSet to our StringIndex (a clean superset). keyOf uses the static-final-raw-arrays Support path (JIT folds the refs). - KnownTagIdsTest: JUnit 5 parity (92 cases) — keyOf<->nameOf round-trip for all 27 tags, nameHash == Entry._hash, INTERCEPTED flags, reserved-vs-stored, serial uniqueness, unknown->0/null. No dense store consumes this yet. The universal fieldPos/SLOT_COUNT layout is ported but marked PROVISIONAL/dormant: keyOf/nameOf depend only on globalSerial + name, not fieldPos, so the ids are stable across any layout scheme. The dense-store slice supersedes the universal layout with role/type sizing (per the over-provision finding). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Slice 2a of the dense-store rebuild. Known tags (those KnownTags.keyOf resolves to a stored id) now live in dense parallel arrays (knownIds[]/knownValues[], insertion-ordered, cap-8 x2) with NO per-tag Entry object -- the allocation win, attacking the #1 tracer allocator. Linear scan by globalSerial (reads aren't hot; positional/role-sizing deferred -> no codegen). Coexists with the hash buckets, which hold arbitrary tags. Disjoint by construction: known-ness is global (keyOf deterministic), so a known tag is always dense and never bucketed. That keeps read-through shadow checks within-region -- a parent dense entry can only be shadowed by a local dense entry of the same id, a parent bucket entry only by a local bucket entry -- so the existing bucket read-through code is unchanged; only local-dense and parent-dense emission are added. Woven through: getAndSet/getLocalEntry/removeLocal (route via keyOf/isStored), forEach x3 (local+parent dense via a reused EntryReadingHelper flyweight -> alloc-free serialize; the iterator materializes per-slot, the rare/compat path), size/isEmpty/estimateSize/ isDefinitelyEmpty/visibleParentCount, copy/putAll/clear/fillMap/ fillStringMap/checkIntegrity. size stays bucket-only; knownCount tracks dense; local total = size + knownCount. DORMANT in production: no resolver is registered, so keyOf returns 0, nothing routes to dense, and behavior is byte-identical to today. Validated by OptimizedTagMapDenseForkedTest (isolated JVM, resolver live) -- 12 scenarios incl. the read-through union (read-through to parent dense, child-dense shadows parent-dense, remove-tombstones- parent-dense). Caught + fixed an accounting bug: putAllOptimizedMap's empty-check ignored knownCount, clobbering a dense-only map on merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reuses TagMapFuzzTest's oracle machinery (replay a random action sequence against a HashMap, verify each step + checkIntegrity) but runs it with a LIVE resolver so the dense store actually engages -- the base fuzz uses only key-N names, which never resolve to a known id. Uses a synthetic prefix resolver (known-N -> stored/dense, else bucket) instead of the real KnownTagIds: unbounded known key space (so the dense array grows past cap-8 and the linear scan gets long) + precise routing control. Three regimes cover paths the mixed run alone misses: - known-only: the all-dense map (growth, dense-only putAll/copy/clear/ iterate, knownCount-only size -- where the empty-check bug hid) - custom-only: dense branches stay inert under a live resolver - mixed: both regions + interaction Each regime: 1500 single-map + 400 merge cases. Forked (isolated JVM) since resolver registration is a global static. Complements OptimizedTagMapDenseForkedTest, which validates the real KnownTagIds registry + the read-through union. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Registering the KnownTags resolver is what flips the (already-woven, validated) dense store from dormant to live: once registered, keyOf resolves known tags and they store without a per-tag Entry. Nothing on this branch references KnownTagIds yet, so it stays dormant. Adds KnownTagIds.init() (invoking it runs <clinit> -> register, idempotent) and calls it from CoreTracer construction gated by the system property dd.trace.dense.tags.enabled (default false), alongside the existing PublishState preload. Off -> keyOf is a no-op, tag storage byte-identical to today; on -> dense store engages. Same-jar A/B for benchmarking: run PetClinic with the property false (baseline) vs true (treatment). Expected: alloc win (Entry eliminated) regardless of integration migration, since keyOf is CPU-only (a StringIndex probe, zero alloc); throughput plausibly break-even un-migrated (read-through already removed the per-span trace-tag merge, offsetting the transitional keyOf tax), with CPU upside reserved for when integrations adopt setTag(long). Promote to a Config flag for a permanent rollout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
set/getAndSet(String, ...) now resolve keyOf FIRST and route a known tag straight to the dense store, constructing no Entry; only the custom fallback builds the typed Entry (preserving no-boxing for primitives). getAndSet(Entry) splits into getAndSetKnown / getAndSetBucket so the entry-based callers share the same routing. This was alloc-neutral in a shallow microbenchmark (escape analysis already scalar-replaced the throwaway Entry), but makes the win EA-independent on the deep real hot path (DDSpanContext.setTag -> ...), where inline-budget pressure can keep EA from firing. DenseStoreAllocBenchmark: deterministic -prof gc A/B, today (all bucket) vs dense (~70% real known tag names), using the REAL allocation-free KnownTagIds resolver. Result (B/op): 7 tags 408->376 (-8%), 12 tags 704->416 (-41%); buildAndSerialize == buildMap (flyweight emit is alloc-free). The win is real and scales with known-tag count. (An earlier synthetic prefix resolver allocated in keyOf/nameOf and masked it.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
buckets is no longer eagerly allocated: every map points at a shared immutable EMPTY_BUCKETS (length 16, all null) until its first custom-tag write copies-on-write to a private array (materializeBuckets). An all-known / known-heavy map -- notably the trace-tier read-through parent (all trace-constant known tags) -- now allocates ZERO buckets, realizing the design's "all-known allocates zero buckets". Length stays 16, so reads need no null guard and read-through bucket alignment (hash & 15) holds; only the four write sites COW (getAndSetBucket, putAllMerge, putAllIntoEmptyMap is gated on source having bucket entries, clear resets to the sentinel). EmptyHolder allocates its OWN length-16 array rather than reading EMPTY_BUCKETS: the nested holder can initialize before OptimizedTagMap's <clinit> sets that static (same static-init-order hazard as the EMPTY cycle), which would leave EMPTY with null buckets -- caught by the fuzz (buildImmutable returns TagMap.EMPTY; iterating it NPE'd). -prof gc (buildMap, B/op), real resolver: all-known 7 tags 408->176 (-57%), 12 tags 704->400 (-43%); realistic mixed span unchanged at 376/416 (it materializes buckets for its custom tags anyway). Full internal-api test + forkedTest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The EntryReader iterator (TagMap.iterator()/keySet()/values()) now emits dense entries via a reused flyweight instead of materializing an Entry per dense tag. EntryReader is a "use-now" view, so this is contract-safe. This was the PetClinic macro regression: the real serializer's count pre-pass (TraceMapperV0_4:95 / V0_5:238) iterates `for (EntryReader : tags)`, which materialized one Entry per dense tag per span at serialize -- clawing back the build alloc win and then some. entrySet() (Iterator<Map.Entry>) keeps returning real, retain-safe Entry objects: EntriesIterator sits on top of the EntryReader iterator and materializes via .entry() per next() (bucket -> the real stored Entry, free; dense -> a fresh Entry). Deliberately not alloc-optimized -- bulk reads use forEach/EntryReader; manual instrumentation does point get/set. Replaces the old `(Iterator) map.iterator()` cast, which would hand out reused flyweights as Map.Entry once the EntryReader iterator flyweights. denseReader flyweight is lazily created (first dense emit), so bucket-only iterators allocate nothing extra. -prof gc, iterator-serialize path (buildAndSerializeViaIterator), dense: 12 tags 736->440 B/op (now -40% vs today, was +5%); allKnown 12t 936->480 (-34%, was +33%). Win restored on the materializing path. Full internal-api test + forkedTest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft for preservation — not for merge yet. Per plan: lands after next week's release, with both the read-through and dense pieces behind experimental flags (off by default, flipped on internally to bake).
This branch is stacked: read-through (level-split) commits + the dense known-tag store on top.
Read-through (level-split phase 1) — stacked, tracked separately
The mechanism is draft #11789. This branch also carries the consumer wiring (
mergedTracerTagsas a read-through parent) + benchmark; those will be split into the read-through PR(s), which land first. Still TODO: add an experimental flag gating the read-through parent wiring (currently gated only on!needsIntercept).Dense known-tag store (this PR's focus)
Known tags (those
KnownTags.keyOfresolves to a stored id) store in dense parallel arrays (knownIds/knownValues, insertion-ordered, cap-8 ×2) with no per-tagEntry— the #1 tracer allocator. Coexists with the hash buckets (arbitrary tags); disjoint by global known-ness, so read-through shadow checks stay within-region.KnownTags(tagId encoding) +KnownTagIds(registry + open-addressedkeyOfviaStringIndex)OptimizedTagMap(set/get/remove/forEach/size/copy/putAll/clear/iterate)EMPTY_BUCKETSsentinel + copy-on-write; all-known maps allocate zero bucketsEntryReaderiterator — alloc-free dense reads on the serialize path;entrySet()keeps real retain-safeEntrydd.trace.dense.tags.enabled(experimental system property; → promote to a Config flag before ship)Validation
-prof gc(real resolver, realistic mixed span): per-build alloc −8% (7 tags) → −41% (12 tags); all-known / trace-tier shape −57%/−43%.OptimizedTagMapDenseForkedTest(real registry + read-through union),TagMapDenseFuzzForkedTest(3 key regimes, HashMap oracle +checkIntegrity),DenseStoreAllocBenchmark.The CPU upside (skip
keyOfby setting tags viasetTag(KnownTagIds.X)) arrives via incremental instrumentation migration over time — no big-bang.🤖 Generated with Claude Code