Reduce log volume#6568
Open
PSeitz-dd wants to merge 6 commits into
Open
Conversation
The rate-limited tracing macros emitted a separate "suppressed N similar log messages" line before the next allowed line. Attach the count to that line instead, as a `suppressed_in_last_min` field. This makes the suppression rate visible inline on the message it belongs to and removes a distinct log pattern, slightly lowering volume.
…EM-759) A fleet-wide extract showed ~1.07B log lines, dominated by a single per-gossip-tick INFO (~907M, ~85%). Reclassify high-frequency operational logs to cut default-level volume by ~20x while preserving actionable signal: - INFO -> DEBUG for routine per-operation chatter with no liveness value: node pool-add (x4), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding shards assignment. - INFO -> rate-limited INFO (1/min) for logs worth a heartbeat: new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard. - ERROR -> rate-limited ERROR (1/min) for the recurring lambda invocation failure, keeping it visible without flooding. Error/failure branches are untouched (e.g. actor-exit failure stays at ERROR). Stage/publish, merge-completion, and cluster lifecycle stay at INFO.
The publish-new-splits log carried no fields. Add num_splits, num_docs, and total on-disk split size to give operators visibility into publish throughput and split sizing without adding a new log line. num_splits is >1 only for partitioned sources (a single commit produces one split per partition); merges publish a single output split, with the merged inputs recorded in replaced_split_ids.
This is a low-volume (~80K), operationally meaningful event: the searcher spilling search work to Lambda. It carries capacity/cost signal and pairs with the lambda invocation error we deliberately keep visible. Demoting it to DEBUG bought no real volume reduction, so revert it to INFO.
Keep visibility into ingester, searcher, and generic-service pool membership at INFO but cap each to 1/min. These are far lower volume than the indexer pool-add (~907M), which stays at DEBUG.
…uppressed Rate-limit the indexer pool-add log (~907M, the dominant pattern) at INFO 1/min instead of demoting to DEBUG. At 1/min this collapses to ~1.4K/day -- the same volume win as DEBUG -- while keeping pool membership visible at INFO, consistent with the other three pool-add logs. Rename the rate-limit suppressed-count field from `suppressed_in_last_min` to `num_suppressed`: the count is messages suppressed since the call site last emitted, and since the window only resets on the next call it can span more than a minute, so the old name was misleading.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem(s) was I solving?
CloudPrem (BYOC) emits an enormous volume of operational logs — a fleet-wide extract showed ~1.07B log lines, of which a single
INFOline (adding node … to indexer pool) accounted for ~907M (~85%). CLOUDPREM-759 targets a 10x reduction by reclassifying verboseINFOchatter and rate-limiting recurringWARN/ERRORfloods — without losing operationally useful signal.Changes
Rate Limiting Macro: the suppressed count is now folded into the emitted line as a
num_suppressed=Nfield, instead of a separate preceding line.In combination with #6549, this lets us aggregate on the field and recover the true pre-suppression log count.
3 Types of Log Changes:
INFO→DEBUG(routine per-operation chatter): node pool-add ×4 (the ~907M dominant source), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding-shards-assignment.INFO→ rate-limitedINFO(1/min) (worth a heartbeat): new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard.ERROR→ rate-limitedERROR(1/min): the recurring lambda invocation failure (kept visible, not flooding).Publish observability (
log_publisher_impl.rs): the barepublish-new-splitsline now carriesnum_splits,num_docs, and total on-disksplit_size.Error/failure branches are untouched (e.g. actor-exit failure stays
error!). Stage/publish, merge-completion, and cluster join/ready stay atINFO.