Add observed status fields to the CQ proposal#9759
Conversation
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@DerekFrank @jmdeal could you please take a look if this API change makes sense from the Karpenter perspective? Thanks! |
|
/assign x13n |
|
Can you clarify the use case here? Why wouldn't invalid selectors be validated on admission? Keeping last valid spec as a part of status seems a bit non-standard, I'd like to understand why we need this. |
|
@x13n thanks for taking a look!
Other than basic label selector validation, cloud providers might configure some custom validation rules, for example blocklisting certain labels. In some managed k8s services, cloud providers offer pricing per resources requested by pod. One example being GKE Autopilot, another one being EKS Fargate (please correct me if I'm wrong). Using a CapacityQuota with a certain label selector could force the autoscaler to provision a machine that is larger than required for unschedulable pods
They could. Though adding a validation layer on the controller side offers stronger reliability: even if webhook configuration has been removed, controller will still mark an invalid quota as such and will prevent it from being enforced.
It's also nothing out of ordinary though. For example, |
|
Should this be closed in favor of #9788? |
|
@x13n to be honest I'd still add the observed fields for observability purposes and preventing races. Let me rephrase the proposal a bit |
99401f3 to
f69ba70
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: norbertcyran The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| memory: 128Gi | ||
| nodes: 50 | ||
| conditions: | ||
| - type: Valid |
There was a problem hiding this comment.
Since this is meant to be a shared resource across clusters, how do we think about this validation status condition in multi-tentant clusters? For example, I know of some users running both Karpenter and Cluster Autoscaler in the same cluster (or multiple versions of Karpenter). Having a single validation condition means that multiple autoscalers with conflicting validation policies may fight over this value.
There was a problem hiding this comment.
In addition to how we represent this via conditions, what do we think the semantics for validation in multi-tenant clusters should be? Should an autoscaler respect a limit so long as its own validation succeeds, or must all validations succeed for it to be respected by any autoscaler?
There was a problem hiding this comment.
Tbh I don't know much about multi-autoscaler setups. How do you route unschedulable pods to the corresponding autoscaler to prevent multiple scale ups?
At the simplest, we could add a label selector option to the controller, so it handles quotas only from a single autoscaler. Or an option to disable the reconciliation in one of the autoscalers, so only one takes ownership of the quotas. Though in such case the chosen owner would have to implement both validation policies
There was a problem hiding this comment.
Partitioning pods across autoscalers is currently done with taints and label selectors. That being said, I don't think that's the right solution for this API. If you want to set a restriction on something like the number of a certain instance type, you probably don't want to pre-partition that across autoscalers (e.g. if you allow 10 c5.larges you wouldn't want to say CAS gets 5 and Karpenter gets 5).
I had a couple of ideas of how this could be addressed, depending on the semantics that we want for multi-autoscaler. The cleanest solution from an API standpoint would be for each autoscaler to &= the validation condition for a given spec generation rather than setting it directly. If it fails validation for any autoscaler it would be set to false and subsequent updates could not set it to true until the generation is bumped. There would be a race where, if the first autoscaler marks it as valid and the second does not it could still be temporarily respected by the autoscaler that validated it, but I think that's fine. We could get the same semantics without the race using validating admission policies.
The other solution would be to have a validated condition per autoscaler (e.g. ValidatedCAS, ValidatedAWSKarpenter, ValidatedEKSKarpenter). Each autoscaler would only respect its own validation condition. It would be the responsibility of autoscaler maintainers to ensure keys don't conflict.
It will be used by the backend to enforce the quota
One higher-level question though - if this is being used to enforce the quota rather than an in-memory store, what's the advantage of this approach over VAPs? Users could edit the status directly in the same way that they could remove a VAP or webhook configuration. Would the autoscaler always revalidate the observedSpec before applying the limit? Using VAPs would remove these concerns since it ensures all autoscalers validations have passed before admission.
There was a problem hiding this comment.
++ to the notion of users running multiple autoscalers, we are seeing this as well with users wanting to use karpenter and cluster autoscaler in some topologies.
There was a problem hiding this comment.
The cleanest solution from an API standpoint would be for each autoscaler to &= the validation condition for a given spec generation rather than setting it directly
It breaks idempotency of the reconciler though. Also as you mentioned, it introduces a race. I think using VAP/webhook are better options for this specific use case
Using condition per autoscaler pollutes the API IMO, though it would work, I think. Autoscalers could accept a config param like --capacity-quota-valid-condition=cloud.google.com/cq-valid.
However, to keep it simple, I think I'd lean towards leaving the ownership of validation to a single, chosen autoscaler. In CAS, we implemented a Validator interface for custom CQ validation options:
Such a chosen autoscaler would need to implement all validations. All autoscalers would be required to enforce the quota, but only one would be validating it.
if this is being used to enforce the quota rather than an in-memory store, what's the advantage of this approach over VAPs?
I'd say it's a different use case, right now I'd keep the observed fields only to prevent races between enforcement and validation - i.e. the autoscaler would not use a new spec until the reconciler validates it. Their goal would be no longer to protect from manual status updates, since this can't really be fully protected without that in-memory store.
I think we should avoid changing the behavior of the quota depending of the autoscaler that processes it. IMO it can be slightly different from cluster to cluster, but within one cluster, that behavior should remain the same, despite the autoscaler looking at the quota. Having that said, we could drop the valid condition and observed fields and rely on VAPs/webhooks, though that leaves more space for races and bypassing the validation by the cluster admin
There was a problem hiding this comment.
I think for the complex setups with multiple node autoscalers and additional constraints on what CapacityQuota objects are allowed, this shouldn't be some consensus between multiple node autoscaler instances, but rather a single policy. The simplest approach would be to implement this on admission. In such case only the intersection of valid configurations would be allowed. Alternatively, cluster admin could decide on their own how to set up the admission. In my mental model, Valid means the object itself is well formed and any additional constraints are only a policy thing, to further restrict some configurations in a specific environment. I believe this is at least the case with Cluster Autoscaler - as far as I can tell, having a correct label selector is pretty much everything that is required. @jmdeal do you foresee any technical limitations that would prevent some versions of Karpenter from correctly handling some CapacityQuota configurations? Or is it only about policy?
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Added observed* status fields to CapacityQuota proposal.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: