Skip to content

Remove dependency on the runner's volume #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nikola-jokic
Copy link
Collaborator

No description provided.

Copy link

@austinpray-mixpanel austinpray-mixpanel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

@zarko-a
Copy link

zarko-a commented Aug 18, 2025

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

I'll take a stab at responding as I'd really like to get this feature out as soon as possible :)

Cloning a volume via your cloud provider's API, then mounting it inside K8S is FAR more complicated than doing a simple copy via exec API. My understanding is that runner copies only the job "spec" (for the lack of better word) and maybe nodejs binary to what used to be a shared volume. Although maybe node is copied from the init container actually, I don't have the full picture of Nikola's implementation yet. In any case the size of this is relatively small and I don't see why it shouldn't be reliable. Doing a whole PV clone for <100MB of files seems like a huge overkill. Potentially heavy operations like repo cloning actually happen in workflow pod and wouldn't be copied using kube exec API.

Most importantly runner container hooks are written to be pretty generic and not prefer one cloud provider over the other.
Be careful what you wish for, even if they decided to implement something like you are suggesting. GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

@austinpray-mixpanel
Copy link

Hey @zarko-a! Yeah thanks for braining this out with me

my main concern was
"I have significant doubts that this will be a stable approach. At scale we observe even trivial use cases for the exec api (like exec into a pod and check for existence of a file on a cron) to fail for all sorts of reasons."

To expand on that:

  • Anecdotally the exec API is super flakey. We experience lots of random connection issues and timeouts when we issued execs 100s of times per day as a part of our deploy workflows. This is anecdotal on Kube 1.25-1.27 though. We removed exec api stuff from the hot path around the time 1.27 was released.
    • I'm happy to burn some $$$ if we want to stress test this. Like Spin up hundreds of pod pairs and use this code to copy files between the pods.
  • Logically the exec API is dependent on control plane uptime, which is not 100%
    • For instance in GKE land the control plane has a 99.5% and 99.95% monthly uptime SLA for zonal and regional clusters respectively. Intentional control plane upgrades and other things like that could also cause api downtime which would fail worker setup.

👉 So at minimum I would expect this implementation to expect these execs to fail or be interrupted. Heavily integrate backoff retries or something like that.

GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

Well yeah if there was an ADR out for cloud specific providers my team would for sure contribute a GCP one in short order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants