Skip to content

feat(helm): add support for vpa #4210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat(helm): add support for vpa #4210

wants to merge 1 commit into from

Conversation

t3mi
Copy link
Contributor

@t3mi t3mi commented Aug 22, 2024

What this PR does / why we need it:
This PR adds support the the vertical pod autoscaler so users do not have to care about the resource usage of the controller and let the vertical pod autoscaler to do its job.

How does this PR make you feel:
gif

If applicable:

  • this PR contains documentation
  • this PR contains tests
  • this PR contains YAML Samples

Copy link
Member

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: Did you see an actual need for VPA-ing the ASO pod? Do you just VPA everything in your clusters?

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

controlledResources:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.verticalPodAutoscaler.controlledValues }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Can .Values.verticalPodAutoscaler.controlledValues ever be unspecified in the values.yaml? It seems like probably not? Is the if-guard here really needed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they could. Default would be RequestsAndLimits value.

@t3mi
Copy link
Contributor Author

t3mi commented Aug 28, 2024

I'm curious: Did you see an actual need for VPA-ing the ASO pod?

Yes as we have multiple clusters with different operator load: some teams are heavily using ASO, some - just few resources to try and in the end we need to have a cost effective solution for "middleware" without a toil for picking correct resources values per cluster/team.

Do you just VPA everything in your clusters?

We're trying to use it for most of the components if it supported in the charts.

I'm most interested in if this is useful to you because you want VPA to scale the pod down (default requests are too high) or you want VPA to scale the pod up (default requests are too low, or limits too low).

Depends on the application - for ASO it would be beneficial to be used in both directions as it's not that critical and can tolerate a restart.

@matthchr
Copy link
Member

Sorry this has been lingering - will review it shortly!

@matthchr
Copy link
Member

Synced with @theunrepentantgeek and @super-harsh on this topic and we have concerns that running VPA on ASO as it stands today may cause more harm than good. The reasoning is:

  1. ASO runs a single replica by default, and when that replica is down webhooks won't work which causes a degraded user experience in-cluster. I've got an item to improve this here: Change default pod count to 2 #4215
  2. VPA restarts the pod when it scales, but ASO re-syncs with Azure and re-reconciles resources after a restart, which means that VPA-ing ASO may optimize your cluster memory/cpu usage but at the expense of sending more requests to Azure, which may cause subscription throttling or other problems (depending on how many resources you have in the cluster). A restart here or there is fine but regular restarts could be problematic. We need to understand more about how frequent VPA restarts can be to better understand the impact/risk on Azure subscription throttling.
  3. We haven't seen any evidence that the fixed limits we have defined for ASO are especially problematic. If you have specific data on average CPU/memory you'd like to share with us I'd love to see it. In particular, it seems unlikely that VPA would scale ASO out past the default limits, which are quite generous for how much CPU/memory the pod actually uses.

With the above said, our current stance on this PR is to hold off on merging it at least for now. We'll re-evaluate in a few weeks when I get #4215 done and possibly have more time to play around with VPA configurations.

In the meantime, if you want to run VPA with ASO, do so with caution: but you should be able to do so even without Helm chart integration because all you need to do is create the autoscaling.k8s.io/v1 resources.

Signed-off-by: t3mi <t3mi@users.noreply.github.com>
@antonioquintavalle1A
Copy link
Contributor

@t3mi, have you done any benchmark that would help modeling CPU/memory consumption for a given amount of CRs controlled by ASO?
On top, have you observed any performance degradation on K8S api / etcd components when the number of managed CRs grows?

@t3mi
Copy link
Contributor Author

t3mi commented Mar 12, 2025

@antonioquintavalle1A sorry missed notification. We haven't done any benchmarking. We faced etcd availability issues but that was before introduction of selective crdPattern feature. There were no issues for us with ASO after we limited crds to a subset of the approved and used resources, except some edge cases not related to performance.

@jlhuilier-1a
Copy link
Contributor

jlhuilier-1a commented Jul 25, 2025

Hello, we are using ASO with VPA and we encountered problems with the GOMEMLIMIT https://github.com/Azure/azure-service-operator/blob/main/v2/charts/azure-service-operator/templates/apps_v1_deployment_azureserviceoperator-controller-manager.yaml#L73
As this setting is also controlling the memory go can use. So without dynamically increasing it, I guess it must be disabled when VPA is activated or set from the request or limit value with downward api
Maybe overall it would be good to integrate https://github.com/KimMachineGun/automemlimit to have automatic settings from cgroup settings

Symptoms we got on our side when upgrading to an ASO version with this settings were higher memory (GOMEMLIMIT was set to 400MiB but POD was consuming 1,2 GiB) and huge increase in CPU (X600) as trying to garbage collect all the time.

But we do have optimisation problems with ASO. We get huge memory consumption ~2GB that are not due to the ASO resources themselves (around 10 resources), but more around the watches done by the operator. I suspect the Secret caching is generating this crazy amount of memory and should be optimised/reviewed

@matthchr
Copy link
Member

But we do have optimisation problems with ASO. We get huge memory consumption ~2GB that are not due to the ASO resources themselves (around 10 resources), but more around the watches done by the operator. I suspect the Secret caching is generating this crazy amount of memory and should be optimised/reviewed

Do you have a lot of secrets in the cluster?

Can you file an issue for this? (or maybe 2 issues, one for high memory usage possibly due to secrets/configmap caching, other for allowing GOMEMLIMIT to work with VPA?)

@matthchr matthchr added the waiting-on-user-response Waiting on more information from the original user before progressing. label Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage 🔍 waiting-on-user-response Waiting on more information from the original user before progressing.
Projects
Development

Successfully merging this pull request may close these issues.

4 participants