Add dynamic watching and pluggable RBAC for scalable objects in capacity buffers#9756
Add dynamic watching and pluggable RBAC for scalable objects in capacity buffers#9756abdelrahman882 wants to merge 1 commit into
Conversation
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: abdelrahman882 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
4e3fd01 to
5ff4865
Compare
|
/assign @norbertcyran |
|
@abdelrahman882: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| }) | ||
|
|
||
| // Start the informer exactly once | ||
| go informer.Run(c.stopCh) |
There was a problem hiding this comment.
it's perfectly safe to call c.dynamicInformerFactory.Start(c.stopCh) here. It's non-blocking (it's the factory that starts and manages the goroutine), and it starts the informers that haven't been yet started, so it doesn't matter if the factory has already been started earlier
| } | ||
|
|
||
| // UpdateRBAC logs a guide for the user on how to manually provide necessary RBAC permissions. | ||
| func (d *DefaultRBACUpdater) UpdateRBAC(mapping *meta.RESTMapping) error { |
There was a problem hiding this comment.
so UpdateRBAC just builds a message? I think it may be misleading, why do we need a separate interface for that? It looks like something that could be implemented in a simple, unexported function
|
|
||
| roleName := fmt.Sprintf("cluster-autoscaler-dynamic-%s-%s", | ||
| strings.ReplaceAll(strings.ToLower(gvr.Group), ".", "-"), | ||
| strings.ReplaceAll(strings.ToLower(gvr.Resource), ".", "-")) |
There was a problem hiding this comment.
that assumes that such a role exists, it has a correct binding, and it's correctly configured to aggregate roles with the DefaultAggregationLabel. I don't think it's a safe assumption, as there are many ways of deploying CA and its RBAC. We could add those to the helm chart, but that still would assume that CA is deployed with helm, which is also not a safe assumption.
In general, I don't think such a detailed guide (not sure if it will even get formatted correctly) should appear in the logs. The user who created the buffer may not even have access to the CA logs.
I'd suggest just surfacing a standard permission error via a condition, it should be enough information for the cluster admin to configure appropriate permissions. We can add some more detailed documentation to README/FAQ or scalableRef description in CRD
| Eventually(func(g Gomega) { | ||
| b, err := buffersClient.AutoscalingV1beta1().CapacityBuffers(namespace).Get(ctx, "b1", metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Expect(b.Status.Replicas).To(Equal(new(int32(2)))) |
There was a problem hiding this comment.
why? new can be used to create pointers for primitive objects since go 1.26
| _ = k8sClient.CoreV1().PodTemplates(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{}) | ||
| }) | ||
|
|
||
| It("should establish a watch on a custom CRD", func() { |
There was a problem hiding this comment.
that suggest it's a success scenario, but as far as I see, it's a failure condition. Could you make it clear that it's a failure scenario and add a test case for a success scenario?
| return false, nil // retry | ||
| } | ||
|
|
||
| klog.V(2).Infof("Successfully established dynamic watch for GVR: %v", gvr) |
There was a problem hiding this comment.
do we ever clear EventDrivenReconciliation=False condition?
| } | ||
|
|
||
| if len(buffersToUpdate) > 0 { | ||
| _, errs := c.updater.Update(buffersToUpdate) |
There was a problem hiding this comment.
q: does it handle conflicts properly? I think either this might cause conflict in the main reconciler goroutine or the other way around
| found := false | ||
| for i, cond := range buffer.Status.Conditions { | ||
| if cond.Type == EventDrivenReconciliationCondition { | ||
| buffer.Status.Conditions[i] = newCondition | ||
| found = true | ||
| break | ||
| } | ||
| } | ||
| if !found { | ||
| buffer.Status.Conditions = append(buffer.Status.Conditions, newCondition) | ||
| } |
There was a problem hiding this comment.
nit: there's a meta.SetStatusCondition or something like that that does the same thing
|
|
||
| klog.V(4).Infof("Establishing dynamic watch for GVK: %v", gvk) | ||
| // Start watch establishment with RBAC guidance and retry mechanism | ||
| go c.establishWatchWithRetry(mapping) |
There was a problem hiding this comment.
I don't like much the fact that we do it in a goroutine, as we lose some of the benefits of the current setup, e.g. built-in workqueue backoff and we introduce a risk of conflicts between this goroutine and the main controller's goroutine. Though I understand that we're pretty limited as we have to reconcile the entire namespace, so doing it all synchronously could negatively affect the performance. Ideally we'd have a separate reconciler for allocating the quotas, and this reconciler would handle only a single buffer, but we do not have an intermediate status field that distinguish desired replicas and admitted replicas with respect to the quotas
| // Start the informer exactly once | ||
| go informer.Run(c.stopCh) | ||
|
|
||
| backoff := wait.Backoff{ |
There was a problem hiding this comment.
Isn't this backoff too aggressive? I think it blocks for ~a minute in case when there's no RBAC configured. I think in such case we should fail fast and inform the user than there are missing permissions
What type of PR is this?
/kind feature
What this PR does / why we need it:
The CapacityBuffer controller faces challenges in defining and provisioning capacity when referencing scalable resources that are not part of the core Kubernetes objects, such as various Custom Resource Definitions (CRDs). Since the controller cannot know every potential GroupVersionKind (GVK) in advance, it struggles with hard-coded support and static RBAC permissions. Furthermore, polling approaches introduce latency and API server load, while watching every possible resource type is not scalable, creating a need for a solution that can reactively and efficiently observe target object states for accurate buffer calculation.
As a result of this, Capacity buffers controller latency for processing buffers when a scale sub resource changes can go up to 5 mins which is the informer sync time.
The implemented fix utilizes a Dynamic Watching approach that enables the controller to resolve unknown GVKs at runtime through a RESTMapper and initialize Dynamic Informers on-the-fly. This system is supported by a dynamic RBAC updater suggests logged for the user to create to ensure the controller has the necessary permissions to watch specific resources only when they are actively referenced in the cluster. This reactive architecture eliminates the dependency on slow polling intervals by triggering near-instantaneous re-reconciliation for buffers namespace whenever a state change is detected in a target workload.
In case the user has not applied the cluster role to give CA permissions to watch the referenced CRD, buffers controller will log the needed role and will set a condition EventDrivenReconciliation to false with descriptive message clarifying that for this buffer the reconciliation will be periodic and not event driven and suggests creating the role we logged before.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: