Remove dependency on the runner's volume #244

nikola-jokic · 2025-08-15T16:20:38Z

No description provided.

austinpray-mixpanel

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

zarko-a · 2025-08-18T15:21:08Z

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

I'll take a stab at responding as I'd really like to get this feature out as soon as possible :)

Cloning a volume via your cloud provider's API, then mounting it inside K8S is FAR more complicated than doing a simple copy via exec API. My understanding is that runner copies only the job "spec" (for the lack of better word) and maybe nodejs binary to what used to be a shared volume. Although maybe node is copied from the init container actually, I don't have the full picture of Nikola's implementation yet. In any case the size of this is relatively small and I don't see why it shouldn't be reliable. Doing a whole PV clone for <100MB of files seems like a huge overkill. Potentially heavy operations like repo cloning actually happen in workflow pod and wouldn't be copied using kube exec API.

Most importantly runner container hooks are written to be pretty generic and not prefer one cloud provider over the other.
Be careful what you wish for, even if they decided to implement something like you are suggesting. GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

austinpray-mixpanel · 2025-08-18T16:16:42Z

Hey @zarko-a! Yeah thanks for braining this out with me

my main concern was
"I have significant doubts that this will be a stable approach. At scale we observe even trivial use cases for the exec api (like exec into a pod and check for existence of a file on a cron) to fail for all sorts of reasons."

To expand on that:

Anecdotally the exec API is super flakey. We experience lots of random connection issues and timeouts when we issued execs 100s of times per day as a part of our deploy workflows. This is anecdotal on Kube 1.25-1.27 though. We removed exec api stuff from the hot path around the time 1.27 was released.
- I'm happy to burn some $$$ if we want to stress test this. Like Spin up hundreds of pod pairs and use this code to copy files between the pods.
Logically the exec API is dependent on control plane uptime, which is not 100%
- For instance in GKE land the control plane has a 99.5% and 99.95% monthly uptime SLA for zonal and regional clusters respectively. Intentional control plane upgrades and other things like that could also cause api downtime which would fail worker setup.

👉 So at minimum I would expect this implementation to expect these execs to fail or be interrupted. Heavily integrate backoff retries or something like that.

GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

Well yeah if there was an ADR out for cloud specific providers my team would for sure contribute a GCP one in short order

nikola-jokic added 8 commits August 12, 2025 11:08

experiment using init container to prepare working environment

c819cb0

so far ok

4017dba

wip: add wait

807cc4a

wip: somewhat working

e846842

rename

0bff27d

copy temp only

62b58c6

copy back _temp dir

52ce5c0

Use ACTIONS_RUNNER_IMAGE for init container

01970e7

nikola-jokic mentioned this pull request Aug 15, 2025

K8s mode without PV #160

Open

austinpray-mixpanel reviewed Aug 18, 2025

View reviewed changes

not completely done but migrating container step to a pod

80612a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove dependency on the runner's volume #244

Remove dependency on the runner's volume #244

Uh oh!

nikola-jokic commented Aug 15, 2025

Uh oh!

austinpray-mixpanel left a comment

Uh oh!

zarko-a commented Aug 18, 2025 •

edited

Loading

Uh oh!

austinpray-mixpanel commented Aug 18, 2025

Uh oh!

Uh oh!

Remove dependency on the runner's volume #244

Are you sure you want to change the base?

Remove dependency on the runner's volume #244

Uh oh!

Conversation

nikola-jokic commented Aug 15, 2025

Uh oh!

austinpray-mixpanel left a comment

Choose a reason for hiding this comment

Uh oh!

zarko-a commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

austinpray-mixpanel commented Aug 18, 2025

Uh oh!

Uh oh!

zarko-a commented Aug 18, 2025 •

edited

Loading