Skip to content

Azure: optional ETag concurrency on VMSS capacity updates#9783

Open
mboersma wants to merge 5 commits into
kubernetes:masterfrom
mboersma:azure-vmss-etags
Open

Azure: optional ETag concurrency on VMSS capacity updates#9783
mboersma wants to merge 5 commits into
kubernetes:masterfrom
mboersma:azure-vmss-etags

Conversation

@mboersma

@mboersma mboersma commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/area provider/azure

What this PR does / why we need it:

Adds optional ETag-based optimistic concurrency to the Azure VMSS capacity update path. When enabled, the cached VMSS ETag is sent as If-Match on capacity PUTs, so a concurrent modification is rejected with HTTP 412 instead of silently overwritten. On 412 the autoscaler refreshes its caches and re-plans on the next loop. On by default (AZURE_ENABLE_VMSS_ETAG / enableVMSSEtag).

Which issue(s) this PR fixes:

Fixes #2581

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Azure: optional ETag optimistic concurrency on VMSS capacity updates

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. area/provider/azure Issues or PRs related to azure provider needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 9, 2026
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 9, 2026
Comment thread cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated

// setScaleSet replaces the cached entry for a single VMSS, e.g. after a fresh GET.
// It copies the map before mutating it so readers that obtained the map via
// getScaleSets() are not exposed to a concurrent map write.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the ratio of reads/writes?

If # reads is orders of magnitude more common than writes, returning the internal map might be a good performance optimization - but if they're comparable, returning a clone via getScaleSets() might be advisable to prevent concurrency issues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads dominate because every reconcile loop reads the scale sets, while writes happen only on a GET refresh or right after a capacity update. So returning the live map and paying the copy cost on the rare write (copy-on-write in setScaleSet) is probably the cheaper tradeoff.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

EnableLabelPredictionsOnTemplate bool `json:"enableLabelPredictionsOnTemplate,omitempty" yaml:"enableLabelPredictionsOnTemplate,omitempty"`

// EnableVMSSEtag sends the cached VMSS ETag as If-Match on capacity-changing
// VMSS PUTs so concurrent modifications are rejected with 412 instead of overwritten.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a positive with very few downsides, so I'm curious why the default is off - is this just to ensure people opt-in, or is there more to it? Might this default to on in the future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was just playing it safe, since it changes behavior on the capacity-update path. We can make it on by default or just do away with the flag if you think that's preferable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards on by default with a flag that allows it to be disabled if there's a problem, with the flag being removed in a later release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it's enabled by default now.

Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated
Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated
// A successful PUT changes the server-side ETag. Adopt the new one returned by
// the operation so a follow-up PUT before the next cache refresh still carries a
// valid If-Match rather than overwriting concurrent changes or hitting a 412.
if scaleSet.manager.config.EnableVMSSEtag && resp.Etag != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if EnableVMSSEtag is true but the returned Etag is nil?

I suspect this is in the category of "things that should (almost) never happen", but if it does, we currently retain the current ETag. Is this the right behaviour?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure should always return an ETag here, so this is the "shouldn't happen" path, but if it does, keeping the previous ETag seems like the safe choice.

The next PUT then sends a stale If-Match and gets a 412, which we already handle by invalidating and re-planning. The alternative (clearing it) would make the next PUT unconditional and risk silently overwriting a concurrent change, which is what this feature is meant to prevent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go Outdated
Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go
Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go Outdated
Comment thread cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated
@mboersma

Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 10, 2026
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-autoscaling-e2e-gci-gce-ca-test

GCE flake as far as I can tell.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mboersma, theunrepentantgeek
Once this PR has been reviewed and has the lgtm label, please assign rakechill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mboersma

Copy link
Copy Markdown
Contributor Author

/retest

@mboersma

Copy link
Copy Markdown
Contributor Author

@jackfrancis @tallaxes please take a look when time permits.

@jackfrancis

Copy link
Copy Markdown
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2026
klog.Errorf("virtualMachineScaleSetsClient.BeginCreateOrUpdate for scale set %q failed: %+v", scaleSet.Name, err)
if isPreconditionFailedError(err) {
klog.V(2).Infof("VMSS %s update rejected by ETag precondition; invalidating cache to re-plan next loop", scaleSet.Name)
scaleSet.invalidateInstanceCache()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for safety if we are in this ETAG precondition failure condition we also need to invalidate the size "cache" as well.

// lastSizeRefresh is the time curSize was last refreshed from vmss.Sku.Capacity.
// Together with sizeRefreshPeriod, it is used to determine if it is time to refresh curSize.

Seems like we want to mark the ScaleSet as such: "as of now it is time to refresh curSize". We can either do something like this literally:

scaleSet.lastSizeRefresh = time.Now().Add(-1 * scaleSet.sizeRefreshPeriod)

Or create a new function to do it (we can't invoke invalidateLastSizeRefreshWithLock because we are in the flow of initCreateOrUpdate which already holds the sizeMutex lock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot came up w/ this add'l UT that would that an out-of-band VMSS capacity update would slip through (we've invalidated the instance but the VMSS size reflects the pre-invalidated cache result):

func TestScaleSetETagPreconditionFailureInvalidatesTargetSize(t *testing.T) {
	ctrl := gomock.NewController(t)
	defer ctrl.Finish()

	manager := newTestAzureManager(t)
	manager.config.EnableVMSSEtag = true

	vmssName := "vmss-etag-stale"
	orchMode := armcompute.OrchestrationModeUniform

	manager.azureCache.setScaleSet(vmssName, &armcompute.VirtualMachineScaleSet{
		Name: ptr.To(vmssName),
		SKU:  &armcompute.SKU{Capacity: ptr.To[int64](3)},
		Properties: &armcompute.VirtualMachineScaleSetProperties{
			OrchestrationMode: &orchMode,
		},
		Etag: ptr.To(`W/"old"`),
	})

	mockDeleteClient := NewMockVMSSDeleteClient(ctrl)
	mockDeleteClient.EXPECT().
		BeginCreateOrUpdate(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).
		Return(nil, &azcore.ResponseError{StatusCode: http.StatusPreconditionFailed})
	manager.azClient.vmssClientForDelete = mockDeleteClient

	scaleSet := newTestScaleSet(manager, vmssName)
	scaleSet.sizeRefreshPeriod = manager.azureCache.refreshInterval

	err := scaleSet.IncreaseSize(1)
	assert.Error(t, err)

	manager.azureCache.setScaleSet(vmssName, &armcompute.VirtualMachineScaleSet{
		Name: ptr.To(vmssName),
		SKU:  &armcompute.SKU{Capacity: ptr.To[int64](5)},
		Properties: &armcompute.VirtualMachineScaleSetProperties{
			OrchestrationMode: &orchMode,
		},
		Etag: ptr.To(`W/"new"`),
	})

	target, err := scaleSet.TargetSize()
	assert.NoError(t, err)
	assert.Equal(t, 5, target)
}

klog.Errorf("virtualMachineScaleSetsClient.BeginCreateOrUpdate for scale set %q failed: %+v", scaleSet.Name, err)
if isPreconditionFailedError(err) {
klog.V(2).Infof("VMSS %s update rejected by ETag precondition; invalidating cache to re-plan next loop", scaleSet.Name)
scaleSet.invalidateInstanceCache()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot came up w/ this add'l UT that would that an out-of-band VMSS capacity update would slip through (we've invalidated the instance but the VMSS size reflects the pre-invalidated cache result):

func TestScaleSetETagPreconditionFailureInvalidatesTargetSize(t *testing.T) {
	ctrl := gomock.NewController(t)
	defer ctrl.Finish()

	manager := newTestAzureManager(t)
	manager.config.EnableVMSSEtag = true

	vmssName := "vmss-etag-stale"
	orchMode := armcompute.OrchestrationModeUniform

	manager.azureCache.setScaleSet(vmssName, &armcompute.VirtualMachineScaleSet{
		Name: ptr.To(vmssName),
		SKU:  &armcompute.SKU{Capacity: ptr.To[int64](3)},
		Properties: &armcompute.VirtualMachineScaleSetProperties{
			OrchestrationMode: &orchMode,
		},
		Etag: ptr.To(`W/"old"`),
	})

	mockDeleteClient := NewMockVMSSDeleteClient(ctrl)
	mockDeleteClient.EXPECT().
		BeginCreateOrUpdate(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).
		Return(nil, &azcore.ResponseError{StatusCode: http.StatusPreconditionFailed})
	manager.azClient.vmssClientForDelete = mockDeleteClient

	scaleSet := newTestScaleSet(manager, vmssName)
	scaleSet.sizeRefreshPeriod = manager.azureCache.refreshInterval

	err := scaleSet.IncreaseSize(1)
	assert.Error(t, err)

	manager.azureCache.setScaleSet(vmssName, &armcompute.VirtualMachineScaleSet{
		Name: ptr.To(vmssName),
		SKU:  &armcompute.SKU{Capacity: ptr.To[int64](5)},
		Properties: &armcompute.VirtualMachineScaleSetProperties{
			OrchestrationMode: &orchMode,
		},
		Etag: ptr.To(`W/"new"`),
	})

	target, err := scaleSet.TargetSize()
	assert.NoError(t, err)
	assert.Equal(t, 5, target)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add etags check for CA API calls

4 participants