Fix two verified metric-drift bugs: RDS blocking fallback + high-CPU view definition (architecture review B1, B2)#1265
Merged
Conversation
B1 -- Lite anomaly engine missed the AWS RDS blocking fallback. The anomaly detector counted blocking from v_blocked_process_reports only, while the alert/overview path uses a COALESCE fallback to the always-on v_dmv_blocking_snapshots (the blocking source on RDS, where the blocked-process-report XE session is empty). So on RDS the alert saw blocking while the anomaly engine saw zero. AnomalyDetector's current_blocking now uses the same NULLIF/COALESCE fallback over both views (collection_time window and current_deadlocks unchanged). Backward-compatible -- the fallback only engages when the BPR count is 0. Parity: Dashboard has NO equivalent gap (its blocking alert computes a live sys.sysprocesses count, not collected BPR), so nothing to mirror; left unchanged. B2 -- "high CPU events" was defined two ways. report.daily_summary.high_cpu_events uses ISNULL(total_cpu_utilization, sqlserver_cpu_utilization) >= 80 (total CPU, Linux-safe), but report.daily_summary_v2.high_cpu_count (the today-vs-yesterday tab) used sqlserver_cpu_utilization >= 80 (SQL-only) -- the two views disagreed for the same metric. Both v2 occurrences now use the same ISNULL(total, sqlserver) definition. CREATE OR ALTER VIEW, reruns on upgrade -- no upgrades/ script needed. Parity: Lite has a single high-CPU definition and its cpu_utilization_stats table has no total_cpu_utilization column, so there is no internal split to reconcile (the total-vs-SQL semantic question for Lite is flagged in the review, not changed). From the architecture review (plans/architecture-review.md, local). Lite.Tests 618/618; both fixes verified against source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two verified latent bugs the architecture-altitude review surfaced (
plans/architecture-review.md, B1 + B2). Both confirmed against source before and after the fix.B1 — blocking disappears for the anomaly engine on AWS RDS
Lite/Analysis/AnomalyDetector.cscounted blocking fromv_blocked_process_reportsonly, while the alert/overview path (LocalDataService.Overview.cs) uses aCOALESCEfallback to the always-onv_dmv_blocking_snapshots— which is the blocking source on AWS RDS, where the blocked-process-report XE session is empty. So on RDS the alert fired while the anomaly engine saw zero.current_blockingnow uses the sameNULLIF/COALESCEfallback over both views (samecollection_timewindow;current_deadlocksuntouched).sys.sysprocessescount (not collected BPR), so there's nothing to mirror. Left unchanged (verified).B2 — "high CPU events" defined two different ways
report.daily_summary.high_cpu_eventsusedISNULL(total_cpu_utilization, sqlserver_cpu_utilization) >= 80(total CPU, Linux-safe), butreport.daily_summary_v2.high_cpu_count(the today-vs-yesterday tab) usedsqlserver_cpu_utilization >= 80(SQL-only) — the two views disagreed for the same metric.daily_summary_v2occurrences now use the sameISNULL(total, sqlserver)definition (comments cite [BUG] Linux Host CPU is shown at 100% all the time (SQL Server CPU consumption probably correct) #1048).CREATE OR ALTER VIEW, reruns on upgrade — noupgrades/script needed.cpu_utilization_statstable has nototal_cpu_utilizationcolumn, so there's no internal split to reconcile. (Whether Lite's SQL-only definition should become total-CPU is a real semantic question — flagged in the review, not changed here.)Verification
B1: Lite.Tests 618/618, build green. B2: SQL view change, verified by inspection (not run against SQL2022 per the no-touch-the-test-server rule).
Held for review per the overnight batch — not auto-merged.
🤖 Generated with Claude Code