Skip to content

Release v3.1.0#1250

Merged
erikdarlingdata merged 250 commits into
mainfrom
dev
Jun 29, 2026
Merged

Release v3.1.0#1250
erikdarlingdata merged 250 commits into
mainfrom
dev

Conversation

@erikdarlingdata

Copy link
Copy Markdown
Owner

Release v3.1.0. Full notes in CHANGELOG.md.

Highlights

Testing: installer fresh/upgrade/multi-hop/uninstall (embedded + CLI), data-survival, Azure SQL DB + AWS RDS cloud, embedded-resource upgrade discovery (#772 guard), nightly -- all green.

WARNING: Release cut -- do not merge until approved. Head is dev (required by check-pr-branch.yml). On merge: tag v3.1.0 + publish GitHub Release (triggers SignPath signing).

🤖 Generated with Claude Code

erikdarlingdata and others added 30 commits June 17, 2026 05:50
…ing sweep

The v3.0.0 collector ran its entire multi-database sweep as ONE SqlCommand
under the global 30s CommandTimeoutSeconds, cursoring every online database
into a #temp and returning a single final SELECT. Because nothing streamed
back until the end, the 30s was a cumulative, all-or-nothing budget across
every database: on larger estates the sweep exceeded 30s, failed with SQL #-2
(Execution Timeout Expired), and discarded results from EVERY database, not
just the slow one. Enabled by default and "never-run = due immediately," it
failed on first connect after upgrade and kept retrying the timeout.

Lite now collects one command per database, mirroring CollectQueryStoreAsync:
- On-prem enumerates online/accessible databases into a list on one
  connection, then runs each via [db].sys.sp_executesql with its own command,
  a dedicated 300s timeout, and per-database try/catch. Azure SQL DB connects
  to each database individually. A slow or inaccessible database now fails
  only itself; the rest still persist.
- Within each database the three DMVs (dm_db_partition_stats,
  dm_db_index_usage_stats, dm_db_index_operational_stats) are staged into
  #temp tables with single scans and then joined, giving the optimizer real
  cardinality instead of the bad plans the old monolithic multi-DMV join
  produced on large databases (the sp_IndexCleanup technique).
- Dedicated 300s timeout (matching the FinOps sp_IndexCleanup path) replaces
  the 30s meant for lightweight DMV reads.

The Dashboard's equivalent SQL collector (install/55) was not subject to the
bug (it runs under SQL Agent and persists per database), but is brought to
parity with the same DMV-staging technique for plan quality on large databases.

Validated against SQL Server 2022: install proc collected 585 rows across 12
databases; the Lite [db].sys.sp_executesql + temp-staging wrapper returns rows
in the correct database context. Lite build + 447 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct-stats

Fix #1135: Lite index_object_stats collector times out as all-or-nothing sweep
The low-disk "Volume Free Space" alert was absent from the severity map and
fell through to INFO for every breach, under-prioritizing a condition that can
take a database into recovery/suspect and mis-routing severity-based webhooks.

It now renders WARNING for a normal breach and CRITICAL when the worst breached
volume is critically low (<=3% free or <=2GB free), via a shared
LowDiskAlertGate.IsCriticallyLow rule and an AlertContext.SeverityOverride that
rides through the email badge, Teams card, and Slack sidebar. The metric name is
unchanged, so mute rules, cooldowns, and Alert-History matching are untouched.

Fixed identically in Lite and Dashboard. Covered by AlertSeverityTests and
LowDiskAlertGateTests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-space-severity

Grade Volume Free Space alert severity (WARNING/CRITICAL) (#1136)
…, persistence)

Adds the app-agnostic core for the alert dedup fingerprint + involved-objects
feature, with no app wiring yet:

- AlertIncident record + AlertContext.Incidents (the per-incident unit).
- AlertFingerprint: ForObjects/ForKey/Hash. SHA-256 idiom reused from
  InferenceEngine; server+incident-type scoped; case/order/whitespace-insensitive;
  volatile per-sample fields excluded; original casing preserved for display.
- AlertIncidentRenderer: projects Incidents into AlertContext.Details so the
  fingerprint renders on Teams/Slack/email(x2)/dialog with no renderer changes.
- AlertContextSerializer: persists Incidents (trailing-optional DTO, backward-
  compatible round-trip).

19 unit tests (Lite.Tests): fingerprint determinism/order/case/scoping/volatile-
exclusion, renderer projection across surfaces, serializer round-trip + legacy null.

Refs #1140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extracts the grouping + fingerprint logic out of the (untestable) WPF builders into
shared, unit-tested helpers both apps' live builders will call, so grouping/fingerprint
are identical across Lite and Dashboard:

- BlockingIncidentGrouper: collapses blocked-process samples that are one chain into a
  single incident with the true occurrence count + wait range (fixes gotqn's "same chain
  shown 3x, count says 8"). Identity = resolved contentious object, else database +
  literal-stripped blocked/blocking query pair.
- DeadlockIncidentGrouper: groups deadlocks by sorted involved-object set (multi-DB
  deadlock = one incident; recurrences collapse with a count).
- DeadlockObjectExtractor: pulls db.schema.object names from a deadlock graph
  resource-list (Lite's source; Dashboard already has DeadlockItem.ObjectNames).

10 unit tests covering chain collapse, literal-varying grouping, object-vs-query-pair
identity, multi-DB deadlock, order-independence, and XML extraction.

Refs #1140 #1141

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wires the shared groupers/fingerprint into the live "Detected"/threshold builders in
Lite (the path consumers actually receive):

- BuildBlockingContextAsync: groups blocked-process samples via BlockingIncidentGrouper
  so one chain shows once with its true occurrence count + wait range (was listed once
  per sample, capped at 3 while the count said more), and surfaces "+N more" instead of
  silently dropping. Attaches the dedup fingerprint. (Object identity arrives once the
  Lite collector resolves contentious_object, plan §5.3; falls back to db+query-pair now.)
- BuildDeadlockContextAsync: fingerprints by involved-object set parsed from the deadlock
  graph (DeadlockObjectExtractor) across ALL deadlocks in the window, grouped + counted.
- BuildVolumeFreeSpaceContext / BuildAnomalousJobContext: per-volume / per-job dedup key.

serverName threaded into all four builders (the fingerprint scopes on it). Lite builds
clean, 0 warnings. LRQ builder still pending its query_hash collection change (§5.2).

Refs #1140 #1141

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ity with Lite)

Mirrors the Lite wiring in the Dashboard's live builders so both apps emit identical
fingerprints:

- BuildBlockingContextAsync: dedup by the resolved contentious_object (already produced by
  sp_HumanEventsBlockViewer and surfaced on BlockingEventItem) via the shared grouper.
- BuildDeadlockContextAsync: dedup by involved-object set parsed from the deadlock graph
  with the same shared DeadlockObjectExtractor Lite uses.
- BuildLongRunningQueryContext: dedup key = query_hash, newly captured from
  sys.dm_exec_requests (CONVERT(varchar(18), r.query_hash, 1)) into LongRunningQueryInfo.
- BuildVolumeFreeSpaceContext / BuildAnomalousJobContext: per-drive / per-job key.

serverName threaded into all five builders + their five call sites. Dashboard needed no
schema change (object_names + contentious_object already collected). Builds clean, 0 warnings.

Refs #1140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ry_hash

Completes the Lite live-path parity by collecting the two identity fields Lite was missing
(validated against SQL 2022):

- blocked_process_reports: capture the blocked_process_report event's own object_id/database_id
  and resolve contentious_object server-side in the collection query, mirroring
  sp_HumanEventsBlockViewer EXACTLY (2-part schema.object + identical 'Unresolved: ...' fallback)
  so the fingerprint matches the Dashboard for the same object. New columns added at the end of
  the table + appender; v30 migration (ALTER ADD COLUMN); v_ views union BY NAME so old parquet
  reads back NULL. BuildBlockingContextAsync now uses the resolved object as the identity.
- query_snapshots: capture query_hash (CONVERT(varchar(18), query_hash, 1)) in both the on-prem
  and Azure (#req) snapshot queries; surface it through GetLongRunningQueriesAsync; the Lite LRQ
  builder now emits a query_hash dedup key.

Schema v29 -> v30. Lite builds clean, 0 warnings.

Refs #1140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p-fingerprint

#1140/#1141: stable dedup fingerprint + involved-objects on alert payloads
…oth apps)

Completes #1140 by giving the secondary anomaly path (ANOMALY_*_SPIKE / CPU findings via
AnalysisNotificationService) the same fingerprints as the live "Detected" path:

- DrillDownCollector (both apps): top_deadlocks now carries the involved objects (parsed
  from the deadlock graph via the shared DeadlockObjectExtractor — raw XML NOT surfaced),
  and top_blocking_chains now carries contentious_object. Source columns already existed.
- FindingMessageFormatter.BuildContext (shared): derives context.Incidents from the
  drill-down — deadlock -> involved-object set, blocking -> contentious object / query
  pair, query/CPU -> distinct query_hash — reusing the same shared groupers/fingerprint
  as the live builders, so either path produces an identical key. Incidents are appended
  after the detail items (existing Diagnosis->Advice->drill-down order preserved).

Shared code, so Lite/Dashboard parity is automatic. +4 finding-path tests; 1 existing
count-based test updated (its top_cpu_queries drill-down now yields 2 query incidents).
Lite 492 + Dashboard 487 tests green; both apps build 0-warnings.

Refs #1140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-incidents

#1140: dedup fingerprints on the anomaly/finding alert path (both apps)
…al, both apps)

Adds an opt-in Per-event delivery mode (default stays Summary) that sends one notification
per distinct incident instead of the batched per-cycle card, so downstream automation can
open/track one ticket per incident and count recurrences via the #1140 fingerprint.

- PerEventNotification.Split (shared): one message per incident, capped at the configured
  max-per-cycle, with a trailing "+N more" message that still carries the remaining
  fingerprints so none are silently dropped. Recurrence handling is left to the existing
  edge-triggered gating + the consumer's fingerprint dedup.
- Settings: AlertDeliveryMode (Summary|PerEvent) + AlertPerEventMaxPerCycle (default 10) in
  both apps (Lite App statics + JSON; Dashboard UserPreferences), with load/save/reset.
- Settings UI: a delivery-mode dropdown + per-cycle cap in both SettingsWindows.
- Firing: a SendDetectedAlertAsync helper in each MainWindow routes the "Blocking Detected"
  and "Deadlocks Detected" sends through Per-event when enabled; alert-history recording is
  unchanged (one row per fire).

Scope: GLOBAL setting (per-server override is a tracked fast-follow). 5 unit tests for the
split helper. Lite 497 + Dashboard 487 tests green; both apps build 0-warnings.

Refs #1141

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…notifications

#1141: per-event notification mode for deadlock/blocking alerts (global)
…tqn feedback)

Addresses gotqn's two findings from testing the dev build (the #1140 fingerprint itself
tested great — stable key + climbing Occurrences):

1. Per-event cards no longer carry LESS detail than Summary. AlertIncident now carries
   transient DetailFields (forensic facts), populated by the groupers from the
   representative event: blocking -> Database / Contentious Object / Blocked Query /
   Blocking Query / Lock Mode; deadlock -> Victim SQL / Processes (Lite), Query / Wait
   Resource / Lock Mode (Dashboard). PerEventNotification.Split renders them onto each
   per-incident card and now also carries the source AttachmentXml/FileName so per-event
   email keeps the deadlock_graph.xml / blocked_process_report.xml. Summary rendering is
   untouched (AlertIncidentRenderer.Apply leaves DetailFields off to avoid duplicating the
   builder's own items), and DetailFields are not persisted.
2. Per-event "Current Value" is now the occurrence count (a number, matching Summary),
   not the involved-objects string (which already shows as its own fact).

Lite 500 + Dashboard 487 tests green (3 new: detail preserved, attachment carried,
Current Value = count); both apps build 0 warnings.

Refs #1141 #1140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…detail

#1141: per-event cards keep forensic detail + numeric Current Value (gotqn feedback)
#981 added restart-dedup for the email channel only; a restart cleared the
two guards that suppress a webhook re-send, so reopening Lite re-posted a
Teams/Slack alert already delivered before the restart (identical Dedup Key
and Occurrences). Two-part fix:

1. Webhook cooldown seed (shared, BOTH apps). WebhookAlertService now seeds
   its per-(serverId, metricName) cooldown from alert history on first use,
   mirroring the email seed, via a new IAlertHistoryStore.GetLastWebhookSentUtcAsync.
   Lite filters notification_type IN ('webhook','email+webhook'); Dashboard
   filters NotificationType == "webhook". send_error is NOT filtered on -- it
   tracks the email channel, so an email-failed-but-webhook-sent row must still
   seed. Wired into the WebhookAlertService DI in both MainWindows.

2. Edge-trigger watermark persistence (Lite). The rolling-count gate's
   in-memory watermark (#1091) reset to 0 on restart, so the first sweep
   re-fired for events still in the 1-hour lookback -- and because that gap can
   exceed the cooldown, the seed alone (time-bounded) does not cover it. The
   watermark now persists to a new config_edge_trigger_watermarks DuckDB table
   (upsert on change), seeded before the first sweep at startup.

Dashboard needs no watermark persistence: its deadlock gate re-baselines on
restart (raw delta) or is 5-min-windowed (always within the cooldown the seed
now covers), and blocking is level+cooldown -- none produce the byte-identical
duplicate the Lite edge-trigger gate does.

Tests: Lite 505 + Dashboard 487 green. New: webhook-row history filter +
watermark save/load/upsert round-trips + WebhookAlertService seed-suppresses /
seed-older-than-cooldown-does-not / null-store-attempts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…send

Fix #1145: Webhook (Teams/Slack) alerts re-fire after an app restart
…surfacing it

Upgrading the app appeared not to take effect: the apps are single-instance
(constant mutex) and minimize to tray, so an old build kept running after the
user "closed" it, and launching the new build just surfaced the old in-memory
version via the mutex. Fix: a version-aware handoff at startup — a newer build
closes an older tray-resident one and takes over, instead of being handed back
the stale version.

Shared PerformanceMonitor.Ui:
- SingleInstanceDecision: pure, unit-tested decision (older->take over;
  same/newer->surface; older-but-higher-integrity->actionable error).
- ProcessInspector: Win32 — read the other instance's release version from its
  on-disk exe (QueryFullProcessImageNameW, cross-integrity for same user),
  measure integrity level directly, detect split-token admin. Fails closed.
- SingleInstanceCoordinator: acquire-or-handoff; prompt; graceful exit signal
  (old runs its real shutdown), bounded wait, force-kill last resort; mutex
  take-over; elevated relaunch (--upgrade-takeover) for the elevated-old case.
- MessageBoxHandoffPrompts: shared dialogs (both apps, parity).

Both apps: OnStartup runs the coordinator (replacing the inline mutex/surface
block) synchronously before any window/DB/port init; OnExit disposes it;
MainWindow opens the exit-for-upgrade channel only after init (so a newer build
won't disturb a mid-initializing instance). Scoped by exe name so Lite never
targets Dashboard and vice-versa. Local\ session scoping kept intentionally.

Two adversarial plan reviews + one implementation review folded in (version
field = ProductVersion not FileVersion; mutex-throw vs an elevated instance ->
integrity-error path not crash; deferred exit listener; direct IL measure;
runas gated to split-token admins; UAC-cancel handled; handles disposed).

Tests: Lite 524 + Dashboard 487 green; 0 new warnings. Manual smoke testing of
the upgrade/elevation paths still required before merge (see
plans/single-instance-upgrade-handoff.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-upgrade-handoff

Single-instance upgrade handoff: close the stale instance instead of surfacing it
Low-risk dependency refresh (Velopack 0.x→1.2.0 and DuckDB held as separate efforts):

- Microsoft.Extensions.* (Configuration, Configuration.Json, Hosting, Logging,
  Logging.Abstractions): 10.0.8 -> 10.0.9 (tracks the .NET 10 servicing line)
- ModelContextProtocol + ModelContextProtocol.AspNetCore: 1.3.0 -> 1.4.0
- Microsoft.NET.Test.Sdk: 18.5.1 -> 18.6.0 (test projects)

Lock files regenerated (--force-evaluate) for --locked-mode CI restore.
Build clean (0 new warnings); Lite 524 + Dashboard 487 + Installer.Tests (fast
subset) 61 green. MCP 1.4.0 compiled with no source changes needed.

ScottPlot.WPF, Microsoft.Data.SqlClient, Hardcodet, CredentialManagement, xunit
are already at latest. WPF/.NET stays on .NET 10 (11 is preview).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y-bumps

Bump minor/patch NuGet dependencies (Extensions 10.0.9, MCP 1.4.0, Test.Sdk 18.6.0)
Velopack 1.x is the stable line; this also corrects a latent mismatch — build.yml
ran `dotnet tool install -g vpk` unpinned, so releases were already packed with vpk
1.x while the app library trailed at 0.0.1298. This aligns the reader library with
the packer (now pinned to vpk 1.2.0) without changing the feed format.

- Dashboard + Lite: Velopack PackageReference 0.0.1298 -> 1.2.0
- build.yml: `dotnet tool install -g vpk --version 1.2.0` (was unpinned)
- Lock files regenerated (--force-evaluate); they shrink because Velopack 1.x
  dropped the NuGet.Versioning transitive dep (custom SemanticVersion, 1.0.1).

No source changes needed — VelopackApp.Build().Run(), UpdateManager, GithubSource,
CheckForUpdatesAsync/DownloadUpdatesAsync/ApplyUpdatesAndRestart all unchanged.
Build clean (0 new warnings); Lite 524 + Dashboard 487 green.

Live cross-release auto-update to be validated at the next release (checklist 8b).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bump Velopack 0.0.1298 → 1.2.0 and pin the vpk CLI to match
The only build warning across the solution: UnforceFunc was read in
UnforcePlanAsync but never assigned by any test, so it was always null
(CS0649). Removed the field and its no-op invoke; UnforcePlanAsync returns
the same default outcome as before. Solution now builds at 0 warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eanup

Build health: remove dead test seam (solution to 0 warnings)
The repo's .gitattributes already normalizes text (`* text=auto eol=crlf`) and
the tree is already normalized (`git add --renormalize .` is a no-op). Gap: with
the global eol=crlf rule, a future *.sh would be checked out CRLF and fail under
Git Bash, which this repo uses heavily. Add `*.sh text eol=lf`. No tracked .sh
files today, so this changes nothing now — it's a latent-footgun guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harden .gitattributes: force LF for shell scripts
Both fixes bring Dashboard in line with the already-correct Lite copies;
surfaced by a Lite<->Dashboard code-sharing drift audit.

1. SqlServerBaselineProvider: the full (hour, day-of-week) bucket tier was
   assigned via a copy-paste ternary whose two arms both returned
   BaselineTier.Full, so sparse buckets (count < CollapseThreshold) were
   mislabeled HourOnly in baseline_tier. Every bucket on this path is Full;
   HourOnly/Flat are assigned only on the collapse/flat paths. Matches Lite.

2. SqlServerAnomalyDetector.DetectBlockingAnomalies: blocking/deadlock spike
   ratios compared a raw window count against a per-hour baseline mean, so the
   ratio scaled with window length (default 4h) and a steady event rate could
   trip the spike threshold. Normalize current counts to per-hour before the
   ratio, mirroring Lite.

Dashboard build + 487 Dashboard.Tests pass. No Lite changes (already correct).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erprint

The email/webhook cooldown was keyed on (serverId, metricName), ignoring the
#1140 per-incident dedup fingerprint, so a genuinely distinct deadlock/blocking/
query/job/disk incident arriving inside the EmailCooldownMinutes window was
silently dropped from email/Teams/Slack (the tray still fired). Per-event mode
also collapsed to one notification per cycle because each per-incident send
shared the single metric key.

Introduce a shared IncidentCooldown (PerformanceMonitor.Notifications) keyed per
fingerprint: send if any incident in the alert is outside its window, stamp every
candidate key on success, and fall back to the metric-level key when an alert
carries no fingerprintable incident (CPU/memory/poison-wait/tempdb/failed-job --
behavior unchanged). The restart seed (#981 email, #1145 webhook) is now
per-fingerprint, reconstructed from the persisted ContextJson via an anchored
"DedupKey" match (null-guarded for the Dashboard scan's null-context rows); the
webhook null-store no-seed path is preserved. The per-fingerprint dict is bounded
by the 2x-window eviction idiom reused from AnalysisNotificationService.

Both apps and both channels run the identical shared decision; only the seed query
differs (Lite DuckDB LIKE vs Dashboard in-memory scan), both pinned to the real
serializer output by tests.

Tests: Lite 544/544, Dashboard 488/488.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
erikdarlingdata and others added 29 commits June 26, 2026 14:31
The per-event vs summary delivery mode shipped as a global setting (#1141).
This adds the optional per-server override the original request mentioned:
Per-event for one noisy prod box while the global default stays Summary.

- ServerConnection (both apps) gains a nullable AlertDeliveryModeOverride;
  null inherits the global, persisted in the existing servers.json (no new store).
- Shared AlertDeliveryModeResolver (Notifications) centralizes the precedence
  (override wins, null inherits) so Lite and Dashboard can't drift.
- SendDetectedAlertAsync in both apps resolves the effective mode per server
  before splitting per-event; Lite maps its int serverId hash back to the server.
- Add/Edit Server dialog (both apps) gets an "Alert delivery" combo
  (Use global setting / Summary / Per-event), wired through load + save.
- Tests: resolver precedence + ServerConnection JSON round-trip incl.
  legacy-file-without-field inherits global (both suites).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-alert-delivery-mode

Per-server override for alert delivery mode (#1236)
A fresh Lite install / nightly extract created %LOCALAPPDATA%\...\config\ empty
and never seeded it from the bundled config\ignored_wait_types.json. Because
LoadIgnoredWaitTypes() reads only the per-user path, a clean box returned an
empty set (cached by the Lazy), the wait filter became a no-op, and every benign
wait (SOS_WORK_DISPATCHER, DISPATCHER_QUEUE_SEMAPHORE, CLR_AUTO_EVENT, ...)
flooded collection and the wait stats tab.

- App: seed the per-user config dir from the bundled copies on first run
  (copy-if-absent; never clobber a user-edited file), via new ConfigSeeder.
- LoadIgnoredWaitTypes: fall back to the bundled copy if the per-user file is
  still missing, so the filter can never silently be empty; warn if neither.
- Tests: ConfigSeeder copies when absent, never overwrites, no-ops on a
  missing bundle.

Lite-only: Dashboard seeds its list server-side into config.ignored_wait_types
during install, so it's unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Seeding/copying ignored_wait_types.json fixes collection going forward but can't
remove rows already in the DuckDB, and the wait-stats tab had no display-time
filter. So a box that collected benign waits before the filter was active (e.g. a
fresh extract that ran before the per-user JSON existed) kept showing
SOS_WORK_DISPATCHER, DISPATCHER_QUEUE_SEMAPHORE, CLR_AUTO_EVENT, etc. dominating
the tab even after the JSON was put in place.

- New IgnoredWaitTypes: one shared source for the ignored set (per-user copy then
  bundled fallback) plus a sanitized "AND wait_type NOT IN (...)" builder.
  Collection (RemoteCollectorService) and display (LocalDataService) both use it,
  so the two lists can't drift.
- LocalDataService wait queries (top list, picker distinct types, total trend)
  exclude ignored waits at query time. Non-destructive: rows stay in the DuckDB
  and age out via retention; nothing is deleted.
- Test for the exclusion-clause builder incl. injection-safety sanitization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…red-waits

Fix #1240: Lite seeds per-user ignored_wait_types.json on first run
Clicking a multi-series trend chart's built-in legend key (or its line)
now dims the other series and auto-fits the Y axis to the clicked one, so
a series that sits flat under the big lines becomes readable. Clicking
again, switching to another series, or double-clicking (autoscale)
restores the full view. It is a transient view toggle only: it never
changes the picker selection and never refetches or deletes data, and any
re-render (picker/time-range change or background poll) resets it.

The mechanic lives entirely in the shared PerformanceMonitor.Ui
ChartHoverHelper that every dynamic-legend chart in both apps already
routes its series through, so Lite and Dashboard cannot drift:
- capture each series' identity color from MarkerStyle.FillColor in Add
- left-click handlers branch once on legend-panel containment, then run
  the ScottPlot 5.1.58 legend hit-test or the existing line hit-test
  (click-vs-drag < 5px; never sets e.Handled, so pan/zoom keep working)
- Isolate/Restore dims via each series' own color (Dark/Light/CoolBreeze
  safe) and clears+restores Dashboard's LockedVertical axis rule so the
  Y-fit actually sticks (Lite installs no rule, so it is a no-op there)
- a static ConditionalWeakTable<WpfPlot, ChartHoverHelper> + TryGetForChart
  lets the per-app autoscale handlers clear an active isolate first

Per-app hooks: both the "Revert (Autoscale)" menu item and the
double-click handler in Dashboard TabHelpers and Lite ContextMenuHelper
call Restore() before AutoScale() (4 sites, symmetric).

Pure helpers (toggle transitions, dim/restore decision, Y-fit range math
incl. the degenerate-flat guard, axis-rules bookkeeping) are unit tested
in both suites via ChartClickIsolateTests (20 each). Both apps build
green; Dashboard.Tests 687 pass, Lite.Tests 628 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-review fixes on the chart click-to-isolate feature, found during
re-verification (the implementer's self-review and green tests had missed them):

- Restore was unfaithful for line-only charts. CollectorDuration and the trend
  charts build line-only (MarkerSize 0, no fill) and never call StyleScatter, but
  Isolate/Restore re-ran StyleScatter on every series, sprouting density markers
  and a gradient fill ribbon they never had (until the next poll-rebuild). Add now
  snapshots each series' full visual state (identity color, line color/width,
  marker size, FillY); RestoreSeriesVisual writes it back, re-running StyleScatter
  ONLY for fill charts (it regenerates the gradient from the unchanged data, so it
  reproduces the original). Faithful for fill, line-only, and flat-StyleScatter'd
  series. +2 headless regression tests per suite.

- Double-click no longer relies on e.ClickCount on the terminal up (uncertain WPF
  semantics). A MouseDoubleClick handler sets a suppress flag consumed at the TOP
  of OnLeftButtonUp, before the _leftPressed gate: the 2nd-down's
  PreviewMouseLeftButtonDown is marked Handled by Control.HandleDoubleClick so our
  press handler is skipped and _leftPressed is already false there -- consuming the
  flag later would leave it stuck and swallow the next genuine click.

Both apps build green; Dashboard.Tests 689, Lite.Tests 630, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…late

Add chart click-to-isolate on legend keys and series lines
Per UX feedback: isolating a series rescaled the Y axis to that series' own
high-water marks every time -- useful for a buried/flat line, but jarring when
the series is already prominent. Drop the auto-fit: isolate now just dims the
other series and leaves the axes alone (to inspect a buried wait, deselect the
big ones in the picker, which re-renders + autoscales).

Removes the now-unused Y-fit + axis-rule machinery (AutoFitYToSeries,
ComputeIsolateYLimits, SaveAndClearRules/RestoreAxisRules, the _preIsolateLimits
and _savedRules fields) and their unit tests. Restore now just un-dims.

Both apps build green; Dashboard.Tests 677, Lite.Tests 618, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make click-to-isolate dim-only (no Y auto-fit)
…ifest fix

- Bump Dashboard/Lite/Installer/Installer.Core to 3.1.0 (Version/AssemblyVersion/FileVersion/InformationalVersion)
- CHANGELOG: roll [Unreleased] -> [3.1.0] - 2026-06-27; add full-detail entries for the
  shipped-but-undocumented work since 3.0.0 (block-chain viewer, deadlock graph viewer,
  always-on DMV blocking-snapshot fallback, incident clustering, FinOps object-growth/locking
  heatmaps, per-server alert delivery override, MCP status envelope, Lite ignored-waits
  seeding, Lite picker-chart N+1 fix)
- README: collector counts 33->34 and 25->26 to match the schedule table; add the
  block-chain/deadlock viewers to both apps' tab lists; Recommendations "grouped by severity"
  -> "grouped into incidents" (#1214); note the always-on DMV blocking fallback in the AWS RDS section
- upgrades/3.0.0-to-3.1.0: add the missing upgrade.txt manifest (a folder with no manifest is
  silently skipped by ScriptProvider, so the blocking_ecid/monitor_loop ALTER would never run on
  upgrade) and add the required SET-options + USE PerformanceMonitor header to the ALTER script

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ging from human prose

The shipped 3.0 advice prose (FactAdvice.cs) read like folklore to a human card reader:
maybe-if hedging and MCP tool names (get_*) where it should have stated the measured number.
Root defect = unincorporated facts; the hedging and tool-names were symptoms.

- All 56 advice blocks now COMPOSE from the fact set at analysis time: they state the collected
  Value/Metadata (current MAXDOP/CTFP, max server memory, RCSI-off DB count, dominant lock mode,
  SOS signal-wait share, lead-blocker permutations) instead of telling the reader to run a tool.
  Tool/field names stripped from the composed path.
- Object-name bug fixed: ANOMALY_OBJECT_GROWTH/CONTENTION prose promised to name an object, but
  both collectors SELECTed schema/table/index then dropped them. New Fact.ObjectName carrier
  (Metadata stays doubles-only); both AnomalyDetectors read the dropped columns; composers state
  dbo.Orders / index IX_*.
- Remediation regrounded: each remediation states the co-fired findings that actually fired
  (PLAN_REGRESSION/PARAMETER_SENSITIVITY/MISSING_INDEX/PLAN_WARNING/CXPACKET) instead of a bag of
  guesses; SOS "more cores?" ties to the stated signal-wait share.
- Static-fallback scrub: the ~36 static _byKey blocks (render only for legacy empty-StoryText
  findings, which self-heal on the next analysis run) had the full bag-of-tricks + tool names;
  surgically scrubbed, legit DMV/sp_/perfmon refs kept. LCK_RANGE now routes via ComposeRangeLock.

Dashboard.Tests 574 / Lite.Tests 547 green (+compose tests incl. a real LCK_M_RS_S regression guard).
Audience-dehedge rebuild; a larger prose rewrite is planned separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dience-dehedge

Advice engine: compose all 56 blocks from facts; strip tool-names/hedging from human prose
…conciliation

My first changelog pass walked only the most recent commits and missed a large middle
band of post-3.0.0 work. Reconciled the full merge list (96 PRs since v3.0.0) against
the changelog and added every previously-missing user-facing entry:

- Advice engine: the sourced/fact-composed advice rebuild (#1244) + the correctness
  cluster (#1185 PLE, #1187 MAXDOP topology code-fix, #1192, #1194 five wrong claims,
  #1196-#1198/#1203 composer value-stating)
- In-app plan navigation on every query surface (#1184)
- Dashboard Queries/tab-load responsiveness (#1181/#1182/#1190)
- Active Queries refresh-on-view (#1183); themed resolved/cleared toasts (#1186)
- Desktop single-instance upgrade handoff (#1148)
- Lite interactive UI-thread offload (#1193/#1202); View Plan no-op fixes (#1181/#1190)
- FinOps Database Sizes init-order race (#1179)
- Failed-job alert dedup + restart-replay (#1157/#1173)
- Dashboard anomaly/baseline drift vs Lite (#1155)

[3.1.0] is now 8 Added / 11 Changed / 17 Fixed; all 46 issue/PR refs link-resolve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone InstallerGui project was retired in 2.9.0 and its directory deleted
from the repo; two stragglers still implied it exists. Drop the "GUI Installer" item
from the PR-template component checklist, and reword 99_installer_troubleshooting.sql's
header (it is a 99_ script, excluded from install, so no functional change). CHANGELOG
mentions are historical records of the retirement and are left intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grid

The empty-state lane (most often Blocking/Deadlocking on a healthy server)
rendered as a dead black box: HideGrid() plus an EmptyTickGenerator on both
axes, and a "No Data" label pinned at (0,0) that SyncXAxes immediately shoved
off-screen when it overrode the X limits to the real time range.

Replace that with a live, gridded lane that matches the populated lanes: keep
the grid, set a 0-1 Y axis with normal numeric ticks, and use
DateTimeTicksBottomDateChange so the vertical gridlines align with the other
lanes (time labels still only on the bottom File I/O lane). Drop the
never-visible "No Data" text. ShowEmpty becomes an instance method in both apps
so it can reference FileIoChart.

Mirrored in Lite and Dashboard (sync-paired control). Both build clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpty-grid

Overview lanes: render empty Blocking/Deadlocking lane as a live 0-1 grid
… 3.1.0 date 2026-06-28

#1245 landed on dev after the initial 3.1.0 changelog reconciliation, so it was
missing from the release notes. The empty-state Blocking/Deadlocking Overview
lane now renders as a live 0-1 grid matching the populated lanes instead of a
dead black box (both apps). Added as a [3.1.0] Fixed entry with its reference
link, and bumped the [3.1.0] date to today (finalized at the actual cut).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….0 prep guard)

The embedded ScriptProvider (the path the Dashboard uses, distinct from the CLI's
filesystem provider) had zero test coverage for an actual release upgrade. Two new
tests, both reading the real resources compiled into Installer.Core.dll:

- EmbeddedUpgrades_3_0_0_To_3_1_0_DiscoverableWithManifestAndScript pins this
  release's 3.0.0->3.1.0 hop through GetApplicableUpgrades (the method #772 broke):
  folder discovered, not skipped for a missing upgrade.txt, manifest lists the
  script, script carries the USE PerformanceMonitor header + the real
  ALTER...blocking_ecid/monitor_loop columns. Guards the prep blocker fixed in d2feb63's branch.

- EmbeddedUpgrades_AllDiscoveredFoldersHaveReadableManifestAndScripts is a
  self-maintaining guard: every embedded upgrade folder must expose a readable
  manifest whose listed scripts all exist and are non-empty.

Full Installer.Tests suite: 82 passed, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The app tracked collection HEALTH (did data arrive?) but never collection
STATE, so disabling the SQL Agent collector jobs was silent: the Dashboard
kept looking healthy (actually calmer, since the live cards read zero rows
from collect.* tables) until a collector aged into STALE on the Collection
Health tab after 24 hours.

Add an app-side check that survives the collector being off, because it's the
collector that fills every other table:
- Live msdb read of msdb.dbo.sysjobs.enabled for PerformanceMonitor% jobs
  (immediate, specific cause), gated on Azure SQL DB and degrading gracefully
  on restricted msdb (RDS / no SQLAgentReaderRole) -- never reports "disabled"
  when it simply could not look.
- A config.collection_log freshness backstop (no run in 30+ min) that also
  catches the Agent service being stopped or collectors silently erroring.

Surfaced as a proactive "Collection Stopped" tray/email alert (new
NotifyOnCollectionStopped pref, default on, mirroring the Capture Down pattern:
cooldown, mute, and a "Collection Resumed" clear) plus a banner on the
Collection Health tab so it shows immediately, not only after the 24h STALE lag.

Decision logic extracted to DatabaseService.DecideCollectionStopped and unit
tested (9 cases). Dashboard-only -- Lite has no Agent jobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ped-detection

Add "Collection Stopped" alert: warn when collector Agent jobs are disabled
#1246 (merged to dev as accd03d) adds a Full Dashboard "Collection Stopped"
alert: an app-side check (live msdb.dbo.sysjobs.enabled for PerformanceMonitor%
jobs + a config.collection_log 30-min freshness backstop) that survives the
collector being off, surfaced as a tray/email alert (new NotifyOnCollectionStopped
pref, default on, with a "Collection Resumed" clear) plus a Collection Health
tab banner. Added as a [3.1.0] Added entry + ref link, and a new row in the
README Alert Types table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n real engine edition

Follow-up to #1246. GetCollectionStatusAsync (the Collection Health tab entry)
passed engineEdition 0, so on Azure SQL DB it issued a doomed msdb.dbo.sysjobs
query and relied on the catch to degrade -- diverging from the alert path and the
CPU/failed-job checks, which all skip cleanly on EngineEdition 5. It now resolves
the real edition via SERVERPROPERTY('EngineEdition') (the same idiom ServerManager
and FinOps.Inventory use) and passes it through, so the tab gates Azure the same
clean way. The msdb try/catch stays as a backstop, and a failed edition read
returns 0 (the inner check still runs), so it never disables the check.

No functional change for supported editions -- the Full Dashboard already rejects
EngineEdition 5 at connection -- so this is a consistency/defense-in-depth fix that
removes the hardcoded-0 smell and keeps the tab in step with the alert path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ped-edition-gate

Dashboard: gate Collection Health tab collector-stopped check on real engine edition
…E for dedup collectors)

Dedup-snapshot collectors (server_properties + the config snapshots) log SKIPPED when
nothing changed -- a successful no-op. But the per-collector health computed
last_success_time from SUCCESS only, so a collector that's correctly skipping showed
STALE, then NEVER_RUN once its last real SUCCESS aged out of log retention. This is the
same "SKIPPED is fine" semantics #1246's freshness backstop already uses.

Dashboard (report.collection_health view, install/47): SKIPPED now counts toward
last_success_time and total_runs, and is included in the recent_failures window so a
skip-only collector doesn't fall through to the consecutive_failures FAILING branch.
Validated live: server_properties on SQL2016/2017/2025 flips STALE/NEVER_RUN -> HEALTHY.

Lite (LocalDataService.CollectionHealth): SKIPPED counts toward last_success_time too, so
a version-gated/dedup collector not on the OnLoadCollectors exemption list no longer
false-STALEs. Build clean, 618 Lite.Tests pass.

Parity fix, both apps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-skipped-status

Collection Health: count SKIPPED as a healthy run (no false STALE for dedup collectors)
#1248 (merged to dev) makes per-collector Collection Health count SKIPPED as a
healthy run, so dedup / skip-if-unchanged collectors (server_properties + the
config snapshots) stop showing false STALE/NEVER_RUN in both apps. Added as a
[3.1.0] Fixed entry + reference link.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit d1e3eed into main Jun 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant