Skip to content

feat: Added details of EntityOperator in Stretch clusters #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

aswinayyolath
Copy link
Owner

Added details of EntityOperator in Stretch clusters

Added details of EntityOperator in Stretch clusters

Signed-off-by: Aswin A <aswin6303@gmail.com>
Copy link

@neeraj-laad neeraj-laad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this is documenting our understanding of how EO works and that is fine. Anything that is specific to EO in a stretch cluster should be documented in a clear concise way the proposal. In general, instead of raising an issue, we should document the solution.

Addressed review comments

Signed-off-by: Aswin A <aswin6303@gmail.com>
@aswinayyolath
Copy link
Owner Author

I've updated the PR. Could you please review it again? @neeraj-laad , I've also added you as a collaborator and sent you an invitation.

@aswinayyolath
Copy link
Owner Author

Pls feel free to edit directly if you think something needs change

Overall. I don't see any major impact to existing connections/client even if the Central cluster hosting the KafkaUser, KafkaTopic CR goes down

The Entity Operator becomes unavailable when the central cluster goes down. However, this does not impact existing Kafka clients directly because

- Kafka clients do not interact with the Entity Operator at runtime.
- User authentication still works as long as secrets (TLS/SCRAM) were distributed to all clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we nee to add this as long as secrets (TLS/SCRAM) were distributed to all clusters.

I would expect Kafka to store user credentials in ZK / controllers and not rely on topic operator implementation.


- Kafka clients do not interact with the Entity Operator at runtime.
- User authentication still works as long as secrets (TLS/SCRAM) were distributed to all clusters.
- Topics and ACLs remain intact but cannot be updated or created until the central cluster recovers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again topic and user management using custom resources is blocked, but users can create topics/users in Kafka as normal.

Copy link
Owner Author

@aswinayyolath aswinayyolath Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the confusion here is but I will try to share what my understanding is and I'm not claiming that I'll be 100% accurate here... and happy to get corrected, But we need to have a handle on this before we revise or add anything in the proposal related to EO.

Kafka clients don't directly interact with the Entity Operator's components (Topic and User Operators) during their normal message operations. Even EO is not mandatory as far as Strimzi is concerned, but is highly recommended. Existing topics and ACLs will remain operational on the Kafka brokers in the member clusters.

However, it's crucial to understand the limitations and the broader impact of the Central cluster failure in this scenario, especially when considering the use of KafkaUser and KafkaTopic custom resources for management.

While clients might continue to send and receive messages for a period, the absence of the Entity Operator in the Central cluster severely restricts the managed control plane for your Kafka environment. This leads to the following critical implications

  1. Although existing authenticated clients might continue, the ability to create new users, reset passwords, or manage user credentials through the KafkaUser CRs is entirely blocked. This will eventually lead to authentication issues as existing credentials expire or need to be updated.

  2. Any changes to authorization rules defined in KafkaUser CRs cannot be applied. The access control is frozen at the state it was in before the Central cluster failure.

  3. New topics cannot be created, and the configuration of existing topics cannot be modified using the KafkaTopic CRs. This limits the ability of applications to scale, adapt to new requirements, or address issues related to topic configuration.

  4. The core benefit of using Strimzi and Kubernetes for managing Kafka (declarative configuration, automation, reconciliation) is significantly impaired. Relying on direct Kafka API interactions for user and topic management bypasses this managed approach and can lead to inconsistencies and operational challenges.

Therefore, while the immediate impact might not be a complete outage for all connected clients, the failure of the Central cluster, and consequently the Entity Operator, creates a significant degradation in the manageability, security, and adaptability of teh stretched Kafka cluster.

Comment on lines 190 to 193
- Kafka brokers in surviving clusters rely on existing secrets for authentication.
- If KafkaUser secrets were only stored in the central cluster and not replicated, brokers in other clusters will be unable to authenticate client requests.
- New client connections will fail since brokers cannot verify credentials.
- Existing client connections may remain active if they were authenticated before the central cluster failure, but they will eventually be disconnected when session timeouts occur.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure of this behaviour? Managing user credentials and ACLS is a Kafka capability and should not be specific to use of entity operator. So I was not expecting these restrictions.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka itself has its own authentication and authorization capabilities. My intention in this section was to highlight the impact on clients within the context of a Strimzi deployment that utilizes the Entity Operator for managing users and their credentials through Kubernetes Secrets. The restriction isn't on Kafka's core functionality, but on the availability of the managed credentials (e.g., TLS certificates stored in Kubernetes Secrets) that the User Operator in the Central cluster typically manages and might not be automatically replicated across all clusters in a stretched setup. Ensuring proper secret replication across clusters is crucial for maintaining authentication in such scenarios.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka itself enforces authentication if credentials are stored inside Kafka (like SCRAM-SHA passwords). But if authentication relies on Kubernetes Secrets (like TLS certs), and those secrets were not replicated, authentication will fail AFAIK

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re correct that Kafka itself is responsible for enforcing authentication and ACLs, regardless of hte Entity Operator’s availability. However, whether authentication continues to work after a central cluster failure depends on how credentials were managed and distributed

SCRAM credentials are stored within Kafka’s metadata log (KRaft). Since these credentials are part of Kafka’s internal state, authentication will continue to work as long as the metadata remains accessible in the surviving clusters.

TLS certificates are typically managed via Kubernetes Secrets when using the KafkaUser CR. If these Secrets were only stored in the central cluster and not replicated to the member clusters, brokers in surviving clusters will be unable to verify new client connections. In this case, authentication failures will occur for new client connections, even though existing sessions may persist until session timeouts.

Kafka itself enforces ACLs stored in its metadata. Existing ACLs will still be applied, but new ACLs cannot be created or updated until the central cluster and the Entity Operator are restored.

@aswinayyolath aswinayyolath force-pushed the entity-operator branch 2 times, most recently from 1a56ca0 to 44645ef Compare March 26, 2025 07:37
Addressed Review comments

Signed-off-by: Aswin A <aswin6303@gmail.com>
Added more clarity

Signed-off-by: Aswin A <aswin6303@gmail.com>
Copy link
Collaborator

@rohan-anil-kumar rohan-anil-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Added details of internal, external access

Signed-off-by: Aswin A <aswin6303@gmail.com>

**Impact on Clients:**

The fact that the User Operator manages credentials and ACLs through Kafka's standard mechanisms means that the availability of the User Operator is crucial for:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this section only for user operator or for both UO and TO?

Comment on lines +196 to +200
- Kafka brokers in the surviving member clusters rely on the configured authentication mechanisms and the presence of valid credentials for client authentication.
- If TLS certificates secrets are not replicated across all clusters, new client deployments and credential updates will fail. However, existing clients with valid TLS certificates will continue functioning until their certificates expire or require rotation. The duration for which they can operate depends entirely on the expiration date set when the certificates were issued.
- If SCRAM credentials have been successfully replicated across the Kafka brokers, existing clients should be able to continue authenticating, even if the Central cluster is down. The issue is with new client deployments or credential updates.
- Existing client connections that were authenticated before the central cluster failure might remain active for a period, but their continued operation depends on multiple factors. If a Kafka broker restarts, clients may need to re-authenticate, which could fail if they rely on new credentials from an unavailable Entity Operator. Additionally, the configured `session.timeout.ms` and Kafka’s reauthentication behavior may determine how long clients remain connected before being disconnected.
- Crucially, the management of credentials (e.g., rotation) through the User Operator will be unavailable.
Copy link

@neeraj-laad neeraj-laad Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is presenting a clear picture and there are a number of sentences with might which we should change to a more assertive sentence.

IMO:

  • administrators will not be able to view create, update, or delete Kafka users and their credentials via Kubernetes custom resources.
  • administrators will not be able to view create, update, or delete Kafka topics and their configuration via Kubernetes custom resources.
  • existing users, topics and client applications using those credentials will continue to work as usual with no disruption as long as the credentials remain valid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really need to call out a plan B, we should provide steps on how can someone setup entity operator on a different cluster using custom resource definitions from a GitOps branch.


- The ACLs defined in KafkaUser CRs are configured on the Kafka brokers. These ACLs will generally remain in place.
- However, any new authorization rules or modifications to existing ones defined in KafkaUser CRs cannot be applied because the User Operator is down.
- TLS certificates used for authentication expire and rotate periodically. Without the User Operator, expired certificates cannot be renewed, leading to eventual authentication failures.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different to Kafka broker certs in absence of central cluster?

Co-authored-by: Neeraj Laad <neeraj.laad@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants