feat(helm): add support for vpa #4210

t3mi · 2024-08-22T17:40:03Z

What this PR does / why we need it:
This PR adds support the the vertical pod autoscaler so users do not have to care about the resource usage of the controller and let the vertical pod autoscaler to do its job.

How does this PR make you feel:

If applicable:

this PR contains documentation
this PR contains tests
this PR contains YAML Samples

matthchr

I'm curious: Did you see an actual need for VPA-ing the ASO pod? Do you just VPA everything in your clusters?

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

...ice-operator/templates/autoscaling.k8s.io_v1_verticalpodautoscaler_azureserviceoperator.yaml

matthchr · 2024-08-27T00:23:24Z

...ice-operator/templates/autoscaling.k8s.io_v1_verticalpodautoscaler_azureserviceoperator.yaml

+      controlledResources:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
+      {{- if .Values.verticalPodAutoscaler.controlledValues }}


minor: Can .Values.verticalPodAutoscaler.controlledValues ever be unspecified in the values.yaml? It seems like probably not? Is the if-guard here really needed then?

Yeah, they could. Default would be RequestsAndLimits value.

t3mi · 2024-08-28T18:01:40Z

I'm curious: Did you see an actual need for VPA-ing the ASO pod?

Yes as we have multiple clusters with different operator load: some teams are heavily using ASO, some - just few resources to try and in the end we need to have a cost effective solution for "middleware" without a toil for picking correct resources values per cluster/team.

Do you just VPA everything in your clusters?

We're trying to use it for most of the components if it supported in the charts.

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

Depends on the application - for ASO it would be beneficial to be used in both directions as it's not that critical and can tolerate a restart.

matthchr · 2024-09-11T17:03:48Z

Sorry this has been lingering - will review it shortly!

matthchr · 2024-09-17T20:45:39Z

Synced with @theunrepentantgeek and @super-harsh on this topic and we have concerns that running VPA on ASO as it stands today may cause more harm than good. The reasoning is:

ASO runs a single replica by default, and when that replica is down webhooks won't work which causes a degraded user experience in-cluster. I've got an item to improve this here: Change default pod count to 2 #4215
VPA restarts the pod when it scales, but ASO re-syncs with Azure and re-reconciles resources after a restart, which means that VPA-ing ASO may optimize your cluster memory/cpu usage but at the expense of sending more requests to Azure, which may cause subscription throttling or other problems (depending on how many resources you have in the cluster). A restart here or there is fine but regular restarts could be problematic. We need to understand more about how frequent VPA restarts can be to better understand the impact/risk on Azure subscription throttling.
We haven't seen any evidence that the fixed limits we have defined for ASO are especially problematic. If you have specific data on average CPU/memory you'd like to share with us I'd love to see it. In particular, it seems unlikely that VPA would scale ASO out past the default limits, which are quite generous for how much CPU/memory the pod actually uses.

With the above said, our current stance on this PR is to hold off on merging it at least for now. We'll re-evaluate in a few weeks when I get #4215 done and possibly have more time to play around with VPA configurations.

In the meantime, if you want to run VPA with ASO, do so with caution: but you should be able to do so even without Helm chart integration because all you need to do is create the autoscaling.k8s.io/v1 resources.

Signed-off-by: t3mi <t3mi@users.noreply.github.com>

antonioquintavalle1A · 2025-02-13T08:58:49Z

@t3mi, have you done any benchmark that would help modeling CPU/memory consumption for a given amount of CRs controlled by ASO?
On top, have you observed any performance degradation on K8S api / etcd components when the number of managed CRs grows?

t3mi · 2025-03-12T19:56:53Z

@antonioquintavalle1A sorry missed notification. We haven't done any benchmarking. We faced etcd availability issues but that was before introduction of selective crdPattern feature. There were no issues for us with ASO after we limited crds to a subset of the approved and used resources, except some edge cases not related to performance.

jlhuilier-1a · 2025-07-25T16:22:08Z

Hello, we are using ASO with VPA and we encountered problems with the GOMEMLIMIT https://github.com/Azure/azure-service-operator/blob/main/v2/charts/azure-service-operator/templates/apps_v1_deployment_azureserviceoperator-controller-manager.yaml#L73
As this setting is also controlling the memory go can use. So without dynamically increasing it, I guess it must be disabled when VPA is activated or set from the request or limit value with downward api
Maybe overall it would be good to integrate https://github.com/KimMachineGun/automemlimit to have automatic settings from cgroup settings

Symptoms we got on our side when upgrading to an ASO version with this settings were higher memory (GOMEMLIMIT was set to 400MiB but POD was consuming 1,2 GiB) and huge increase in CPU (X600) as trying to garbage collect all the time.

But we do have optimisation problems with ASO. We get huge memory consumption ~2GB that are not due to the ASO resources themselves (around 10 resources), but more around the watches done by the operator. I suspect the Secret caching is generating this crazy amount of memory and should be optimised/reviewed

matthchr · 2025-07-25T18:33:54Z

But we do have optimisation problems with ASO. We get huge memory consumption ~2GB that are not due to the ASO resources themselves (around 10 resources), but more around the watches done by the operator. I suspect the Secret caching is generating this crazy amount of memory and should be optimised/reviewed

Do you have a lot of secrets in the cluster?

Can you file an issue for this? (or maybe 2 issues, one for high memory usage possibly due to secrets/configmap caching, other for allowing GOMEMLIMIT to work with VPA?)

t3mi requested review from davefellows, theunrepentantgeek, matthchr, babbageclunk and super-harsh as code owners August 22, 2024 17:40

matthchr reviewed Aug 27, 2024

View reviewed changes

t3mi force-pushed the helm_vpa branch from de9d432 to f5554c1 Compare August 28, 2024 18:05

t3mi force-pushed the helm_vpa branch from f5554c1 to 5338bbf Compare September 4, 2024 18:14

feat(helm): add support for vpa

1371383

Signed-off-by: t3mi <t3mi@users.noreply.github.com>

t3mi force-pushed the helm_vpa branch from 5338bbf to 1371383 Compare December 30, 2024 12:05

matthchr added the needs-triage 🔍 label Jul 25, 2025

matthchr added the waiting-on-user-response Waiting on more information from the original user before progressing. label Jul 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(helm): add support for vpa #4210

feat(helm): add support for vpa #4210

Uh oh!

t3mi commented Aug 22, 2024

Uh oh!

matthchr left a comment

Uh oh!

Uh oh!

matthchr Aug 27, 2024

Uh oh!

t3mi Aug 28, 2024

Uh oh!

t3mi commented Aug 28, 2024

Uh oh!

matthchr commented Sep 11, 2024

Uh oh!

matthchr commented Sep 17, 2024

Uh oh!

antonioquintavalle1A commented Feb 13, 2025

Uh oh!

t3mi commented Mar 12, 2025

Uh oh!

jlhuilier-1a commented Jul 25, 2025 •

edited

Loading

Uh oh!

matthchr commented Jul 25, 2025

Uh oh!

Uh oh!

feat(helm): add support for vpa #4210

Are you sure you want to change the base?

feat(helm): add support for vpa #4210

Uh oh!

Conversation

t3mi commented Aug 22, 2024

Uh oh!

matthchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthchr Aug 27, 2024

Choose a reason for hiding this comment

Uh oh!

t3mi Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

t3mi commented Aug 28, 2024

Uh oh!

matthchr commented Sep 11, 2024

Uh oh!

matthchr commented Sep 17, 2024

Uh oh!

antonioquintavalle1A commented Feb 13, 2025

Uh oh!

t3mi commented Mar 12, 2025

Uh oh!

jlhuilier-1a commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthchr commented Jul 25, 2025

Uh oh!

Uh oh!

jlhuilier-1a commented Jul 25, 2025 •

edited

Loading