Add dynamic watching and pluggable RBAC for scalable objects in capacity buffers by abdelrahman882 · Pull Request #9756 · kubernetes/autoscaler

abdelrahman882 · 2026-06-05T13:59:00Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

The CapacityBuffer controller faces challenges in defining and provisioning capacity when referencing scalable resources that are not part of the core Kubernetes objects, such as various Custom Resource Definitions (CRDs). Since the controller cannot know every potential GroupVersionKind (GVK) in advance, it struggles with hard-coded support and static RBAC permissions. Furthermore, polling approaches introduce latency and API server load, while watching every possible resource type is not scalable, creating a need for a solution that can reactively and efficiently observe target object states for accurate buffer calculation.

As a result of this, Capacity buffers controller latency for processing buffers when a scale sub resource changes can go up to 5 mins which is the informer sync time.

The implemented fix utilizes a Dynamic Watching approach that enables the controller to resolve unknown GVKs at runtime through a RESTMapper and initialize Dynamic Informers on-the-fly. This system is supported by a dynamic RBAC updater suggests logged for the user to create to ensure the controller has the necessary permissions to watch specific resources only when they are actively referenced in the cluster. This reactive architecture eliminates the dependency on slow polling intervals by triggering near-instantaneous re-reconciliation for buffers namespace whenever a state change is detected in a target workload.

In case the user has not applied the cluster role to give CA permissions to watch the referenced CRD, buffers controller will log the needed role and will set a condition EventDrivenReconciliation to false with descriptive message clarifying that for this buffer the reconciliation will be periodic and not event driven and suggests creating the role we logged before.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Cluster autoscaler RBAC needs to be updated to include the aggregator
The changes were tested on a GKE cluster after giving the correct RBAC permissions to CA component
On testing note that capacity buffers controller doesn't support scale from 0 for scale sub resource as the template can't be defined so we there should be at least one running pod

Does this PR introduce a user-facing change?

Improve Capacity Buffers controller reaction time for scalableRef

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2026-06-05T14:02:21Z

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-06-05T14:02:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abdelrahman882 · 2026-06-10T02:06:19Z

/assign @norbertcyran

k8s-ci-robot · 2026-06-10T03:34:54Z

@abdelrahman882: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-autoscaling-e2e-gci-gce-ca-test	`5ff4865`	link	false	`/test pull-autoscaling-e2e-gci-gce-ca-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

norbertcyran · 2026-06-11T14:10:35Z

+	})
+
+	// Start the informer exactly once
+	go informer.Run(c.stopCh)


it's perfectly safe to call c.dynamicInformerFactory.Start(c.stopCh) here. It's non-blocking (it's the factory that starts and manages the goroutine), and it starts the informers that haven't been yet started, so it doesn't matter if the factory has already been started earlier

norbertcyran · 2026-06-11T14:14:39Z

+}
+
+// UpdateRBAC logs a guide for the user on how to manually provide necessary RBAC permissions.
+func (d *DefaultRBACUpdater) UpdateRBAC(mapping *meta.RESTMapping) error {


so UpdateRBAC just builds a message? I think it may be misleading, why do we need a separate interface for that? It looks like something that could be implemented in a simple, unexported function

norbertcyran · 2026-06-11T14:29:39Z

+
+	roleName := fmt.Sprintf("cluster-autoscaler-dynamic-%s-%s",
+		strings.ReplaceAll(strings.ToLower(gvr.Group), ".", "-"),
+		strings.ReplaceAll(strings.ToLower(gvr.Resource), ".", "-"))


that assumes that such a role exists, it has a correct binding, and it's correctly configured to aggregate roles with the DefaultAggregationLabel. I don't think it's a safe assumption, as there are many ways of deploying CA and its RBAC. We could add those to the helm chart, but that still would assume that CA is deployed with helm, which is also not a safe assumption.

In general, I don't think such a detailed guide (not sure if it will even get formatted correctly) should appear in the logs. The user who created the buffer may not even have access to the CA logs.

I'd suggest just surfacing a standard permission error via a condition, it should be enough information for the cluster admin to configure appropriate permissions. We can add some more detailed documentation to README/FAQ or scalableRef description in CRD

norbertcyran · 2026-06-11T14:33:35Z

 			Eventually(func(g Gomega) {
 				b, err := buffersClient.AutoscalingV1beta1().CapacityBuffers(namespace).Get(ctx, "b1", metav1.GetOptions{})
 				g.Expect(err).NotTo(HaveOccurred())
-				g.Expect(b.Status.Replicas).To(Equal(new(int32(2))))


why? new can be used to create pointers for primitive objects since go 1.26

norbertcyran · 2026-06-11T14:37:02Z

+			_ = k8sClient.CoreV1().PodTemplates(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{})
+		})
+
+		It("should establish a watch on a custom CRD", func() {


that suggest it's a success scenario, but as far as I see, it's a failure condition. Could you make it clear that it's a failure scenario and add a test case for a success scenario?

norbertcyran · 2026-06-11T15:33:52Z

+			return false, nil // retry
+		}
+
+		klog.V(2).Infof("Successfully established dynamic watch for GVR: %v", gvr)


do we ever clear EventDrivenReconciliation=False condition?

norbertcyran · 2026-06-11T15:36:35Z

+	}
+
+	if len(buffersToUpdate) > 0 {
+		_, errs := c.updater.Update(buffersToUpdate)


q: does it handle conflicts properly? I think either this might cause conflict in the main reconciler goroutine or the other way around

norbertcyran · 2026-06-11T15:38:12Z

+			found := false
+			for i, cond := range buffer.Status.Conditions {
+				if cond.Type == EventDrivenReconciliationCondition {
+					buffer.Status.Conditions[i] = newCondition
+					found = true
+					break
+				}
+			}
+			if !found {
+				buffer.Status.Conditions = append(buffer.Status.Conditions, newCondition)
+			}


nit: there's a meta.SetStatusCondition or something like that that does the same thing

norbertcyran · 2026-06-11T16:01:24Z

+
+	klog.V(4).Infof("Establishing dynamic watch for GVK: %v", gvk)
+	// Start watch establishment with RBAC guidance and retry mechanism
+	go c.establishWatchWithRetry(mapping)


I don't like much the fact that we do it in a goroutine, as we lose some of the benefits of the current setup, e.g. built-in workqueue backoff and we introduce a risk of conflicts between this goroutine and the main controller's goroutine. Though I understand that we're pretty limited as we have to reconcile the entire namespace, so doing it all synchronously could negatively affect the performance. Ideally we'd have a separate reconciler for allocating the quotas, and this reconciler would handle only a single buffer, but we do not have an intermediate status field that distinguish desired replicas and admitted replicas with respect to the quotas

norbertcyran · 2026-06-11T16:05:30Z

+	// Start the informer exactly once
+	go informer.Run(c.stopCh)
+
+	backoff := wait.Backoff{


Isn't this backoff too aggressive? I think it blocks for ~a minute in case when there's no RBAC configured. I think in such case we should fail fast and inform the user than there are missing permissions

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2026

k8s-ci-robot added do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component labels Jun 5, 2026

k8s-ci-robot requested review from aleksandra-malinowska and x13n June 5, 2026 14:02

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Jun 5, 2026

Add dynamic watching for scalable objects in capacity buffers

5ff4865

abdelrahman882 force-pushed the cbdynamic branch from 4e3fd01 to 5ff4865 Compare June 10, 2026 01:05

k8s-ci-robot assigned norbertcyran Jun 10, 2026

norbertcyran reviewed Jun 11, 2026

View reviewed changes

Conversation

abdelrahman882 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 5, 2026

Uh oh!

k8s-ci-robot commented Jun 5, 2026

Uh oh!

abdelrahman882 commented Jun 10, 2026

Uh oh!

k8s-ci-robot commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abdelrahman882 commented Jun 5, 2026 •

edited

Loading