Skip to content

Conversation

stevefan1999-personal
Copy link

@stevefan1999-personal stevefan1999-personal commented Aug 8, 2025

Overview

This PR addresses and closes #4239 by fixing a critical initialization order issue with the Agones SDK init container that causes deadlocks in certain deployment configurations.

Problem Description

Discovery Context

I discovered this issue while working with a custom init container setup for our game server deployments. Our initialization workflow includes an init container responsible for:

  • Fetching Game Server Login Tokens (GSLT) from Steam's API
  • Retrieving and configuring game server port allocations
  • Generating additional game server metadata and configuration
  • Setting up DDoS proxy protection layers
  • Pre-warming game mod resources and checking for compatibility
  • Preparing token allocation for player authentication

This initialization process has a critical dependency: it requires the Agones SDK to be fully initialized and accessible before it can proceed with port allocation and server registration tasks.

The Deadlock Scenario

Kubernetes init containers execute sequentially in the order they are defined in the pod specification. Each init container must complete successfully before the next one begins execution. This sequential execution pattern is where our problem manifests:

  1. Current Behavior: The Agones SDK init container is being appended to the end of the init containers array
  2. Our Custom Init Container: Positioned earlier in the sequence, waiting for Agones SDK availability
  3. Result: A circular dependency deadlock where:
    • Our init container cannot proceed without the Agones SDK being available
    • The Agones SDK init container cannot start because it's waiting for our init container to complete
    • The pod enters a permanent initialization state and never reaches the ready condition

Root Cause Analysis

The core issue stems from the container injection order. When Agones adds the SDK init container to the final pod specification, it appends it to the existing init containers array rather than prepending it. This breaks the assumption that the Agones SDK will be available for other init containers that depend on it.

Solution

This PR modifies the container injection logic to ensure the Agones SDK init container is placed at the beginning of the init containers array, guaranteeing it initializes before any user-defined init containers that may depend on its functionality.

Impact & Benefits

  • Fixes Breaking Changes: Resolves deadlock issues for users with SDK-dependent init containers
  • Maintains Backward Compatibility: Existing deployments without init container dependencies continue to work as expected
  • Enables Advanced Initialization Patterns: Allows for more complex initialization workflows that can leverage Agones SDK capabilities during the init phase
  • Improves Developer Experience: Removes the need for workarounds or manual container ordering adjustments

Testing Performed

  • Verified init container execution order with multiple init containers present
  • Tested SDK availability from subsequent init containers
  • Confirmed no regression in standard deployment scenarios without init container dependencies
  • Validated the fix resolves the deadlock condition described in Sidecar Agones API server should be the first init container #4239

@stevefan1999-personal
Copy link
Author

stevefan1999-personal commented Aug 8, 2025

@Sivasankaran25 @0xaravindh sorry for a ping but is this syntax legal in Golang? I see that it is a vararg I'm not sure if ... unpack operator works for slices.

Another way to do it:

import "slices"

pod.Spec.InitContainers = slices.Concat(sidecars, pod.Spec.InitContainers)

but this requires go 1.22+

@aravindhkm
Copy link

@stevefan1999-personal yes, since we're already using Go 1.24, it's safe to use slices.Concat() — no issues there. #slices

@markmandel
Copy link
Collaborator

Oh that's a really good catch!

My only extra thought - might be worth adding a unit test to make sure that the sdk-server sidecar is always first, and we don't ever have a regression.

No strong opinions on which slice syntax is used 😄

@markmandel
Copy link
Collaborator

/gcbrun

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: b191037d-1958-41e8-9068-ab10a026abe9

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@0xaravindh
Copy link
Member

/gcbrun

@markmandel
Copy link
Collaborator

/gcbrun

submit-e2e-test-cloud-build is still flaky.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 2d520d12-cba1-4178-9e1c-96c819c2e196

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Collaborator

/gcbrun

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: ca564f19-822c-4573-a93e-5eb0dc0ae8ec

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Collaborator

Ignore the errors for now - CI is pretty borked

@lacroixthomas
Copy link
Collaborator

/gcbrun

@markmandel
Copy link
Collaborator

I'd still love a unit test here so we never regress this change.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 404943de-1fd1-4011-a130-0bea2e3f4863

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sidecar Agones API server should be the first init container
6 participants