Skip to content

Add observed status fields to the CQ proposal#9759

Open
norbertcyran wants to merge 1 commit into
kubernetes:masterfrom
norbertcyran:cq-observed-limits
Open

Add observed status fields to the CQ proposal#9759
norbertcyran wants to merge 1 commit into
kubernetes:masterfrom
norbertcyran:cq-observed-limits

Conversation

@norbertcyran

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Added observed* status fields to CapacityQuota proposal.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. labels Jun 5, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 5, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Jun 5, 2026
@norbertcyran

Copy link
Copy Markdown
Contributor Author

@DerekFrank @jmdeal could you please take a look if this API change makes sense from the Karpenter perspective? Thanks!

@norbertcyran

Copy link
Copy Markdown
Contributor Author

/assign x13n

@x13n

x13n commented Jun 8, 2026

Copy link
Copy Markdown
Member

Can you clarify the use case here? Why wouldn't invalid selectors be validated on admission? Keeping last valid spec as a part of status seems a bit non-standard, I'd like to understand why we need this.

@norbertcyran

norbertcyran commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@x13n thanks for taking a look!

Can you clarify the use case here?

Other than basic label selector validation, cloud providers might configure some custom validation rules, for example blocklisting certain labels. In some managed k8s services, cloud providers offer pricing per resources requested by pod. One example being GKE Autopilot, another one being EKS Fargate (please correct me if I'm wrong). Using a CapacityQuota with a certain label selector could force the autoscaler to provision a machine that is larger than required for unschedulable pods

Why wouldn't invalid selectors be validated on admission?

They could. Though adding a validation layer on the controller side offers stronger reliability: even if webhook configuration has been removed, controller will still mark an invalid quota as such and will prevent it from being enforced.

Keeping last valid spec as a part of status seems a bit non-standard, I'd like to understand why we need this.

It's also nothing out of ordinary though. For example, ResourceQuota has status.hard mirroring spec.hard, CRD has status.acceptedNames, which by the way has a similar use case, since the controller validates spec.names and populates the status field if spec.names is valid. In CRDs is quite common to see spec.field and status.field/status.observedField pairs

@x13n

x13n commented Jun 11, 2026

Copy link
Copy Markdown
Member

Should this be closed in favor of #9788?

@norbertcyran

Copy link
Copy Markdown
Contributor Author

@x13n to be honest I'd still add the observed fields for observability purposes and preventing races. Let me rephrase the proposal a bit

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbertcyran
Once this PR has been reviewed and has the lgtm label, please ask for approval from x13n. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

memory: 128Gi
nodes: 50
conditions:
- type: Valid

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is meant to be a shared resource across clusters, how do we think about this validation status condition in multi-tentant clusters? For example, I know of some users running both Karpenter and Cluster Autoscaler in the same cluster (or multiple versions of Karpenter). Having a single validation condition means that multiple autoscalers with conflicting validation policies may fight over this value.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to how we represent this via conditions, what do we think the semantics for validation in multi-tenant clusters should be? Should an autoscaler respect a limit so long as its own validation succeeds, or must all validations succeed for it to be respected by any autoscaler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I don't know much about multi-autoscaler setups. How do you route unschedulable pods to the corresponding autoscaler to prevent multiple scale ups?

At the simplest, we could add a label selector option to the controller, so it handles quotas only from a single autoscaler. Or an option to disable the reconciliation in one of the autoscalers, so only one takes ownership of the quotas. Though in such case the chosen owner would have to implement both validation policies

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitioning pods across autoscalers is currently done with taints and label selectors. That being said, I don't think that's the right solution for this API. If you want to set a restriction on something like the number of a certain instance type, you probably don't want to pre-partition that across autoscalers (e.g. if you allow 10 c5.larges you wouldn't want to say CAS gets 5 and Karpenter gets 5).

I had a couple of ideas of how this could be addressed, depending on the semantics that we want for multi-autoscaler. The cleanest solution from an API standpoint would be for each autoscaler to &= the validation condition for a given spec generation rather than setting it directly. If it fails validation for any autoscaler it would be set to false and subsequent updates could not set it to true until the generation is bumped. There would be a race where, if the first autoscaler marks it as valid and the second does not it could still be temporarily respected by the autoscaler that validated it, but I think that's fine. We could get the same semantics without the race using validating admission policies.

The other solution would be to have a validated condition per autoscaler (e.g. ValidatedCAS, ValidatedAWSKarpenter, ValidatedEKSKarpenter). Each autoscaler would only respect its own validation condition. It would be the responsibility of autoscaler maintainers to ensure keys don't conflict.

It will be used by the backend to enforce the quota

One higher-level question though - if this is being used to enforce the quota rather than an in-memory store, what's the advantage of this approach over VAPs? Users could edit the status directly in the same way that they could remove a VAP or webhook configuration. Would the autoscaler always revalidate the observedSpec before applying the limit? Using VAPs would remove these concerns since it ensures all autoscalers validations have passed before admission.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to the notion of users running multiple autoscalers, we are seeing this as well with users wanting to use karpenter and cluster autoscaler in some topologies.

@norbertcyran norbertcyran Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanest solution from an API standpoint would be for each autoscaler to &= the validation condition for a given spec generation rather than setting it directly

It breaks idempotency of the reconciler though. Also as you mentioned, it introduces a race. I think using VAP/webhook are better options for this specific use case

Using condition per autoscaler pollutes the API IMO, though it would work, I think. Autoscalers could accept a config param like  --capacity-quota-valid-condition=cloud.google.com/cq-valid.

However, to keep it simple, I think I'd lean towards leaving the ownership of validation to a single, chosen autoscaler. In CAS, we implemented a Validator interface for custom CQ validation options:


Such a chosen autoscaler would need to implement all validations. All autoscalers would be required to enforce the quota, but only one would be validating it.

if this is being used to enforce the quota rather than an in-memory store, what's the advantage of this approach over VAPs?

I'd say it's a different use case, right now I'd keep the observed fields only to prevent races between enforcement and validation - i.e. the autoscaler would not use a new spec until the reconciler validates it. Their goal would be no longer to protect from manual status updates, since this can't really be fully protected without that in-memory store.

I think we should avoid changing the behavior of the quota depending of the autoscaler that processes it. IMO it can be slightly different from cluster to cluster, but within one cluster, that behavior should remain the same, despite the autoscaler looking at the quota. Having that said, we could drop the valid condition and observed fields and rely on VAPs/webhooks, though that leaves more space for races and bypassing the validation by the cluster admin

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the complex setups with multiple node autoscalers and additional constraints on what CapacityQuota objects are allowed, this shouldn't be some consensus between multiple node autoscaler instances, but rather a single policy. The simplest approach would be to implement this on admission. In such case only the intersection of valid configurations would be allowed. Alternatively, cluster admin could decide on their own how to set up the admission. In my mental model, Valid means the object itself is well formed and any additional constraints are only a policy thing, to further restrict some configurations in a specific environment. I believe this is at least the case with Cluster Autoscaler - as far as I can tell, having a correct label selector is pretty much everything that is required. @jmdeal do you foresee any technical limitations that would prevent some versions of Karpenter from correctly handling some CapacityQuota configurations? Or is it only about policy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants