Enhance scale-down latency tracking to report specific reasons for node un-neediness in metrics.#9721
Enhance scale-down latency tracking to report specific reasons for node un-neediness in metrics.#9721ttetyanka wants to merge 3 commits into
Conversation
…de un-neediness in metrics.
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @ttetyanka. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ttetyanka The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
|
the code is making sense to me, i think one question is the cardinality of these metrics. have you done any research about the changes to this metric will affect storage for metrics? (i would assume that this will generate more metrics than previously, and more time-series for each unique label) |
| for nodeName, info := range t.unneededNodes { | ||
| if _, exists := currentSet[nodeName]; !exists { | ||
| t.recordAndCleanup(nodeName, false) | ||
| t.removedUnneededNodes[nodeName] = info |
There was a problem hiding this comment.
Is there a guarantee that Process() is called exactly once after every UpdateScaleDownCandidates()?
I am asking because If UpdateScaleDownCandidates is called twice before Process, the first set of removed nodes will be lost because the map is overwritten and if Process() is not called immediately or if it fails, these nodes might stay in removedUnneededNodes indefinitely or be overwritten.
There was a problem hiding this comment.
Under normal operation, they are guaranteed to execute sequentially within a single iteration of RunOnce:
Registration/Execution: UpdateScaleDownCandidates runs during the main execution of RunOnce (via ScaleDownCandidatesNotifier.Update), while Process is run last inside the deferred block of RunOnce.
Early exits: If RunOnce returns early before the defer is registered, neither method runs. If it exits after registration, Process is guaranteed to run.
I have updated UpdateScaleDownCandidates to re-initialize t.unneededNodesToReport = make(map[string]unneededNodeState) at the very beginning of the call. This ensures that even if Process is somehow skipped, the map is cleared at the start of the next cycle and we do not accumulate or report stale transitioned states.
| duration := time.Since(info.unneededSince) | ||
| latency := duration - info.removalThreshold | ||
|
|
||
| if latency > 0 { |
There was a problem hiding this comment.
removalThreshold is the "cooling-off" period or the grace period that a node must wait after becoming idle before it is actually allowed to be deleted.
Are we recording the metric for both cases of isRemoved only when latency > 0, I mean if a node became needed before removalThreshold would we report the latency for this one? and do we expect a node to not be deleted after removalThreshold?
There was a problem hiding this comment.
The node_removal_latency_seconds metric is designed to measure latency from when an unneeded node is eligible for scale down until it is removed or becomes needed again. A node only becomes eligible for scale down after it has remained unneeded for the removalThreshold duration (e.g. 2 or 10 minutes according to the current configuration). If a node becomes needed again before the removalThreshold is reached (i.e., duration < removalThreshold, meaning latency <= 0), the node was never eligible for scale down. Therefore, we do not record a metric for it.
And yes, a node might not be deleted immediately after removalThreshold has passed. This happens if scale-down is in a cooldown period, if the node group has reached its minimum size, if the node is blocked from deletion by a pod or resource limits etc. Or it can be that the CA is too slow to process the unneeded nodes, thats actually what we want to spot.
@elmiko Time-Series footprint: Since this is a histogram (20 buckets + sum/count = 22 series per label combo), the total series count will increase from 44 series (previously deleted=true|false) to a maximum of ~480 series (assuming all 20 unremovable reasons are hit in the cluster). This should be a negligible increase for Prometheus or any other OSS metrics backend. |
… state on each iteration
|
@ttetyanka: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
ack, thank you @ttetyanka , i appreciate the explanation of the details and i agree it doesn't seem egregious. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR enhances scale-down latency tracking by introducing detailed reasons to the
node_removal_latency_secondsPrometheus metric when a node stops being tracked as unneeded.Previously,
node_removal_latency_secondsonly recorded a booleandeletedlabel. This change adds areasonlabel to capture the exact reason a node's un-neediness was resolved (e.g., whether it was blocked by a pod, reached minimum group size, or simply became needed again).Key changes:
reasonlabel to thenode_removal_latency_secondsPrometheus metric.String()methods forUnremovableReasonto provide human-readable label values.NodeLatencyTrackerto determine and propagate the specific reason when a node stops being tracked as unneeded:BlockedByPod,NoPlaceToMovePods).needed_again.node_latency_tracker_test.goandmetrics_test.goto verify that label assignments and metrics tracking function correctly under all scenario flows.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
node_removal_latency_secondsmetric signature is updated to acceptdeletedandreasonlabels.UnremovableReasontype now implementsfmt.Stringerto provide readable labels for metrics.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: