Skip to content

Reduce log volume#6568

Open
PSeitz-dd wants to merge 6 commits into
quickwit-oss:mainfrom
PSeitz:reduce_logs
Open

Reduce log volume#6568
PSeitz-dd wants to merge 6 commits into
quickwit-oss:mainfrom
PSeitz:reduce_logs

Conversation

@PSeitz-dd

@PSeitz-dd PSeitz-dd commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What problem(s) was I solving?

CloudPrem (BYOC) emits an enormous volume of operational logs — a fleet-wide extract showed ~1.07B log lines, of which a single INFO line (adding node … to indexer pool) accounted for ~907M (~85%). CLOUDPREM-759 targets a 10x reduction by reclassifying verbose INFO chatter and rate-limiting recurring WARN/ERROR floods — without losing operationally useful signal.

Changes

Rate Limiting Macro: the suppressed count is now folded into the emitted line as a num_suppressed=N field, instead of a separate preceding line.
In combination with #6549, this lets us aggregate on the field and recover the true pre-suppression log count.

3 Types of Log Changes:

  1. INFODEBUG (routine per-operation chatter): node pool-add ×4 (the ~907M dominant source), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding-shards-assignment.

  2. INFO → rate-limited INFO (1/min) (worth a heartbeat): new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard.

  3. ERROR → rate-limited ERROR (1/min): the recurring lambda invocation failure (kept visible, not flooding).

Publish observability (log_publisher_impl.rs): the bare publish-new-splits line now carries num_splits, num_docs, and total on-disk split_size.

Error/failure branches are untouched (e.g. actor-exit failure stays error!). Stage/publish, merge-completion, and cluster join/ready stay at INFO.

PSeitz added 3 commits June 30, 2026 18:55
The rate-limited tracing macros emitted a separate "suppressed N similar
log messages" line before the next allowed line. Attach the count to that
line instead, as a `suppressed_in_last_min` field. This makes the
suppression rate visible inline on the message it belongs to and removes a
distinct log pattern, slightly lowering volume.
…EM-759)

A fleet-wide extract showed ~1.07B log lines, dominated by a single
per-gossip-tick INFO (~907M, ~85%). Reclassify high-frequency operational
logs to cut default-level volume by ~20x while preserving actionable signal:

- INFO -> DEBUG for routine per-operation chatter with no liveness value:
  node pool-add (x4), send-to-index-serializer, spawning pipeline, merge
  schedule/download, leaf split-finished/offsets, offload-to-lambda,
  resetting pipeline, adding shards assignment.
- INFO -> rate-limited INFO (1/min) for logs worth a heartbeat: new-split,
  actor-exit (success), assigning shards, env-var defaults, truncated-shard.
- ERROR -> rate-limited ERROR (1/min) for the recurring lambda invocation
  failure, keeping it visible without flooding.

Error/failure branches are untouched (e.g. actor-exit failure stays at
ERROR). Stage/publish, merge-completion, and cluster lifecycle stay at INFO.
The publish-new-splits log carried no fields. Add num_splits, num_docs,
and total on-disk split size to give operators visibility into publish
throughput and split sizing without adding a new log line.

num_splits is >1 only for partitioned sources (a single commit produces
one split per partition); merges publish a single output split, with the
merged inputs recorded in replaced_split_ids.
@PSeitz-dd PSeitz-dd requested a review from a team as a code owner June 30, 2026 17:27
PSeitz added 3 commits June 30, 2026 20:24
This is a low-volume (~80K), operationally meaningful event: the searcher
spilling search work to Lambda. It carries capacity/cost signal and pairs
with the lambda invocation error we deliberately keep visible. Demoting it
to DEBUG bought no real volume reduction, so revert it to INFO.
Keep visibility into ingester, searcher, and generic-service pool membership
at INFO but cap each to 1/min. These are far lower volume than the indexer
pool-add (~907M), which stays at DEBUG.
…uppressed

Rate-limit the indexer pool-add log (~907M, the dominant pattern) at INFO
1/min instead of demoting to DEBUG. At 1/min this collapses to ~1.4K/day --
the same volume win as DEBUG -- while keeping pool membership visible at INFO,
consistent with the other three pool-add logs.

Rename the rate-limit suppressed-count field from `suppressed_in_last_min`
to `num_suppressed`: the count is messages suppressed since the call site
last emitted, and since the window only resets on the next call it can span
more than a minute, so the old name was misleading.
@PSeitz PSeitz changed the title Reduce default log volume by reclassifying verbose INFO logs (CLOUDPREM-759) Reduce log volume Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants