-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Added details of EntityOperator in Stretch clusters #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added details of EntityOperator in Stretch clusters Signed-off-by: Aswin A <aswin6303@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this is documenting our understanding of how EO works and that is fine. Anything that is specific to EO in a stretch cluster should be documented in a clear concise way the proposal. In general, instead of raising an issue, we should document the solution.
Addressed review comments Signed-off-by: Aswin A <aswin6303@gmail.com>
I've updated the PR. Could you please review it again? @neeraj-laad , I've also added you as a collaborator and sent you an invitation. |
Pls feel free to edit directly if you think something needs change Overall. I don't see any major impact to existing connections/client even if the Central cluster hosting the KafkaUser, KafkaTopic CR goes down |
docs/entityoperator.md
Outdated
The Entity Operator becomes unavailable when the central cluster goes down. However, this does not impact existing Kafka clients directly because | ||
|
||
- Kafka clients do not interact with the Entity Operator at runtime. | ||
- User authentication still works as long as secrets (TLS/SCRAM) were distributed to all clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we nee to add this as long as secrets (TLS/SCRAM) were distributed to all clusters.
I would expect Kafka to store user credentials in ZK / controllers and not rely on topic operator implementation.
docs/entityoperator.md
Outdated
|
||
- Kafka clients do not interact with the Entity Operator at runtime. | ||
- User authentication still works as long as secrets (TLS/SCRAM) were distributed to all clusters. | ||
- Topics and ACLs remain intact but cannot be updated or created until the central cluster recovers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again topic and user management using custom resources is blocked, but users can create topics/users in Kafka as normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the confusion here is but I will try to share what my understanding is and I'm not claiming that I'll be 100% accurate here... and happy to get corrected, But we need to have a handle on this before we revise or add anything in the proposal related to EO.
Kafka clients don't directly interact with the Entity Operator's components (Topic and User Operators) during their normal message operations. Even EO is not mandatory as far as Strimzi is concerned, but is highly recommended. Existing topics and ACLs will remain operational on the Kafka brokers in the member clusters.
However, it's crucial to understand the limitations and the broader impact of the Central cluster failure in this scenario, especially when considering the use of KafkaUser and KafkaTopic custom resources for management.
While clients might continue to send and receive messages for a period, the absence of the Entity Operator in the Central cluster severely restricts the managed control plane for your Kafka environment. This leads to the following critical implications
-
Although existing authenticated clients might continue, the ability to create new users, reset passwords, or manage user credentials through the KafkaUser CRs is entirely blocked. This will eventually lead to authentication issues as existing credentials expire or need to be updated.
-
Any changes to authorization rules defined in KafkaUser CRs cannot be applied. The access control is frozen at the state it was in before the Central cluster failure.
-
New topics cannot be created, and the configuration of existing topics cannot be modified using the KafkaTopic CRs. This limits the ability of applications to scale, adapt to new requirements, or address issues related to topic configuration.
-
The core benefit of using Strimzi and Kubernetes for managing Kafka (declarative configuration, automation, reconciliation) is significantly impaired. Relying on direct Kafka API interactions for user and topic management bypasses this managed approach and can lead to inconsistencies and operational challenges.
Therefore, while the immediate impact might not be a complete outage for all connected clients, the failure of the Central cluster, and consequently the Entity Operator, creates a significant degradation in the manageability, security, and adaptability of teh stretched Kafka cluster.
docs/entityoperator.md
Outdated
- Kafka brokers in surviving clusters rely on existing secrets for authentication. | ||
- If KafkaUser secrets were only stored in the central cluster and not replicated, brokers in other clusters will be unable to authenticate client requests. | ||
- New client connections will fail since brokers cannot verify credentials. | ||
- Existing client connections may remain active if they were authenticated before the central cluster failure, but they will eventually be disconnected when session timeouts occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure of this behaviour? Managing user credentials and ACLS is a Kafka capability and should not be specific to use of entity operator. So I was not expecting these restrictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kafka itself has its own authentication and authorization capabilities. My intention in this section was to highlight the impact on clients within the context of a Strimzi deployment that utilizes the Entity Operator for managing users and their credentials through Kubernetes Secrets. The restriction isn't on Kafka's core functionality, but on the availability of the managed credentials (e.g., TLS certificates stored in Kubernetes Secrets) that the User Operator in the Central cluster typically manages and might not be automatically replicated across all clusters in a stretched setup. Ensuring proper secret replication across clusters is crucial for maintaining authentication in such scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kafka itself enforces authentication if credentials are stored inside Kafka (like SCRAM-SHA passwords). But if authentication relies on Kubernetes Secrets (like TLS certs), and those secrets were not replicated, authentication will fail AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re correct that Kafka itself is responsible for enforcing authentication and ACLs, regardless of hte Entity Operator’s availability. However, whether authentication continues to work after a central cluster failure depends on how credentials were managed and distributed
SCRAM credentials are stored within Kafka’s metadata log (KRaft). Since these credentials are part of Kafka’s internal state, authentication will continue to work as long as the metadata remains accessible in the surviving clusters.
TLS certificates are typically managed via Kubernetes Secrets when using the KafkaUser CR. If these Secrets were only stored in the central cluster and not replicated to the member clusters, brokers in surviving clusters will be unable to verify new client connections. In this case, authentication failures will occur for new client connections, even though existing sessions may persist until session timeouts.
Kafka itself enforces ACLs stored in its metadata. Existing ACLs will still be applied, but new ACLs cannot be created or updated until the central cluster and the Entity Operator are restored.
1a56ca0
to
44645ef
Compare
Addressed Review comments Signed-off-by: Aswin A <aswin6303@gmail.com>
44645ef
to
2bf7791
Compare
Added more clarity Signed-off-by: Aswin A <aswin6303@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added details of internal, external access Signed-off-by: Aswin A <aswin6303@gmail.com>
|
||
**Impact on Clients:** | ||
|
||
The fact that the User Operator manages credentials and ACLs through Kafka's standard mechanisms means that the availability of the User Operator is crucial for: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this section only for user operator or for both UO and TO?
- Kafka brokers in the surviving member clusters rely on the configured authentication mechanisms and the presence of valid credentials for client authentication. | ||
- If TLS certificates secrets are not replicated across all clusters, new client deployments and credential updates will fail. However, existing clients with valid TLS certificates will continue functioning until their certificates expire or require rotation. The duration for which they can operate depends entirely on the expiration date set when the certificates were issued. | ||
- If SCRAM credentials have been successfully replicated across the Kafka brokers, existing clients should be able to continue authenticating, even if the Central cluster is down. The issue is with new client deployments or credential updates. | ||
- Existing client connections that were authenticated before the central cluster failure might remain active for a period, but their continued operation depends on multiple factors. If a Kafka broker restarts, clients may need to re-authenticate, which could fail if they rely on new credentials from an unavailable Entity Operator. Additionally, the configured `session.timeout.ms` and Kafka’s reauthentication behavior may determine how long clients remain connected before being disconnected. | ||
- Crucially, the management of credentials (e.g., rotation) through the User Operator will be unavailable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is presenting a clear picture and there are a number of sentences with might
which we should change to a more assertive sentence.
IMO:
- administrators will not be able to view create, update, or delete Kafka users and their credentials via Kubernetes custom resources.
- administrators will not be able to view create, update, or delete Kafka topics and their configuration via Kubernetes custom resources.
- existing users, topics and client applications using those credentials will continue to work as usual with no disruption as long as the credentials remain valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we really need to call out a plan B, we should provide steps on how can someone setup entity operator on a different cluster using custom resource definitions from a GitOps branch.
|
||
- The ACLs defined in KafkaUser CRs are configured on the Kafka brokers. These ACLs will generally remain in place. | ||
- However, any new authorization rules or modifications to existing ones defined in KafkaUser CRs cannot be applied because the User Operator is down. | ||
- TLS certificates used for authentication expire and rotate periodically. Without the User Operator, expired certificates cannot be renewed, leading to eventual authentication failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different to Kafka broker certs in absence of central cluster?
Co-authored-by: Neeraj Laad <neeraj.laad@gmail.com>
Added details of EntityOperator in Stretch clusters