Skip to content

Fix two verified metric-drift bugs: RDS blocking fallback + high-CPU view definition (architecture review B1, B2)#1265

Merged
erikdarlingdata merged 1 commit into
devfrom
feature/fix-metric-drift-bugs
Jun 30, 2026
Merged

Fix two verified metric-drift bugs: RDS blocking fallback + high-CPU view definition (architecture review B1, B2)#1265
erikdarlingdata merged 1 commit into
devfrom
feature/fix-metric-drift-bugs

Conversation

@erikdarlingdata

Copy link
Copy Markdown
Owner

Two verified latent bugs the architecture-altitude review surfaced (plans/architecture-review.md, B1 + B2). Both confirmed against source before and after the fix.

B1 — blocking disappears for the anomaly engine on AWS RDS

Lite/Analysis/AnomalyDetector.cs counted blocking from v_blocked_process_reports only, while the alert/overview path (LocalDataService.Overview.cs) uses a COALESCE fallback to the always-on v_dmv_blocking_snapshots — which is the blocking source on AWS RDS, where the blocked-process-report XE session is empty. So on RDS the alert fired while the anomaly engine saw zero.

  • Fix: current_blocking now uses the same NULLIF/COALESCE fallback over both views (same collection_time window; current_deadlocks untouched).
  • Backward-compatible — the fallback only engages when the BPR count is 0, so existing behavior (and all 618 Lite tests) is unchanged.
  • Parity: Dashboard has no equivalent gap — its blocking alert computes a live sys.sysprocesses count (not collected BPR), so there's nothing to mirror. Left unchanged (verified).

B2 — "high CPU events" defined two different ways

report.daily_summary.high_cpu_events used ISNULL(total_cpu_utilization, sqlserver_cpu_utilization) >= 80 (total CPU, Linux-safe), but report.daily_summary_v2.high_cpu_count (the today-vs-yesterday tab) used sqlserver_cpu_utilization >= 80 (SQL-only) — the two views disagreed for the same metric.

  • Fix: both daily_summary_v2 occurrences now use the same ISNULL(total, sqlserver) definition (comments cite [BUG] Linux Host CPU is shown at 100% all the time (SQL Server CPU consumption probably correct) #1048). CREATE OR ALTER VIEW, reruns on upgrade — no upgrades/ script needed.
  • Parity: Lite has a single high-CPU definition and its cpu_utilization_stats table has no total_cpu_utilization column, so there's no internal split to reconcile. (Whether Lite's SQL-only definition should become total-CPU is a real semantic question — flagged in the review, not changed here.)

Verification

B1: Lite.Tests 618/618, build green. B2: SQL view change, verified by inspection (not run against SQL2022 per the no-touch-the-test-server rule).

Held for review per the overnight batch — not auto-merged.

🤖 Generated with Claude Code

B1 -- Lite anomaly engine missed the AWS RDS blocking fallback. The anomaly
detector counted blocking from v_blocked_process_reports only, while the
alert/overview path uses a COALESCE fallback to the always-on
v_dmv_blocking_snapshots (the blocking source on RDS, where the
blocked-process-report XE session is empty). So on RDS the alert saw blocking
while the anomaly engine saw zero. AnomalyDetector's current_blocking now uses
the same NULLIF/COALESCE fallback over both views (collection_time window and
current_deadlocks unchanged). Backward-compatible -- the fallback only engages
when the BPR count is 0. Parity: Dashboard has NO equivalent gap (its blocking
alert computes a live sys.sysprocesses count, not collected BPR), so nothing to
mirror; left unchanged.

B2 -- "high CPU events" was defined two ways. report.daily_summary.high_cpu_events
uses ISNULL(total_cpu_utilization, sqlserver_cpu_utilization) >= 80 (total CPU,
Linux-safe), but report.daily_summary_v2.high_cpu_count (the today-vs-yesterday
tab) used sqlserver_cpu_utilization >= 80 (SQL-only) -- the two views disagreed
for the same metric. Both v2 occurrences now use the same ISNULL(total, sqlserver)
definition. CREATE OR ALTER VIEW, reruns on upgrade -- no upgrades/ script needed.
Parity: Lite has a single high-CPU definition and its cpu_utilization_stats table
has no total_cpu_utilization column, so there is no internal split to reconcile
(the total-vs-SQL semantic question for Lite is flagged in the review, not changed).

From the architecture review (plans/architecture-review.md, local).
Lite.Tests 618/618; both fixes verified against source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit a6d3a17 into dev Jun 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant