Skip to content

feat: Added details of EntityOperator in Stretch clusters #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ nav:
- Testing-cluster-failover.md
- Testing-failover-and-resiliency.md
- Testing-performance.md
- Setting-up-Rack-Awareness-In-Stretch-Cluster.md
- Setting-up-Rack-Awareness-In-Stretch-Cluster.md
- EntityOperator.md
214 changes: 214 additions & 0 deletions docs/entityoperator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# Impact of Entity Operator Availability in a Stretch Kafka Cluster

The Entity Operator in Strimzi is responsible for managing Kafka users and topics. It automates the creation, configuration, and security settings of these entities, ensuring smooth integration with Kafka clusters deployed via Strimzi. This document explains how its availability affects topic and user management when deployed in a multi-cluster Kafka setup.

## Key Components of Entity Operator

The Entity Operator consists of two main sub-components:

### Topic Operator

- Watches for KafkaTopic CRs in Kubernetes.
- Automatically creates, updates, and deletes topics in Kafka based on KafkaTopic CR definitions.
- Keeps Kubernetes and Kafka topic configurations in sync.
- Ensures desired state consistency between Kubernetes and Kafka.

### User Operator

- Watches for KafkaUser CRs in Kubernetes.
- Manages security credentials (TLS certificates, SASL credentials).
- Ensures user permissions and authentication are correctly configured.

## Why is the Entity Operator Useful?

- Eliminates the need for manual topic and user management.
- Ensures Kafka users have appropriate authentication and authorization settings.
- Enables declarative management using Kubernetes CRs.
- Keeps configurations between Kubernetes and Kafka in sync.

## How Client Applications Use KafkaTopic and KafkaUser CRs in Strimzi

The client applications interact with Kafka topics and users in Strimzi using Kubernetes native resources

- KafkaTopic CRs define and manage Kafka topics.
- KafkaUser CRs define users and security credentials for authentication & authorization.

## How Applications Use KafkaTopic CRs

### Creating a Topic

Developers define a topic declaratively using a KafkaTopic CR. The Topic Operator ensures this topic is created in Kafka.

**Example KafkaTopic CR**

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: my-topic
labels:
strimzi.io/cluster: my-cluster # Must match the Kafka cluster name
spec:
partitions: 3
replicas: 2
config:
retention.ms: 86400000 # Data retention for 1 day
segment.bytes: 1073741824 # 1GB segment size
```

**How clients use it**

Once the topic is created, client applications (producers & consumers) can publish and read messages from `my-topic` like any regular Kafka topic.

## How Applications Use KafkaUser CRs

### Creating a User for Authentication & Authorization

Client applications need a Kafka user to authenticate and communicate securely. A KafkaUser CR defines the user, authentication method (TLS/SCRAM-SHA), and permissions.

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: my-app-user
labels:
strimzi.io/cluster: my-cluster # Must match the Kafka cluster name
spec:
authentication:
type: tls # TLS-based auth
authorization:
type: simple
acls:
- resource:
type: topic
name: my-topic
patternType: literal
operations:
- Read
- Write
```

**How clients use it**

### Authentication

- If `TLS` authentication is enabled, Strimzi will generate a secret containing the user's TLS certificates.
- If `SCRAM-SHA` authentication is enabled, Strimzi will generate a username and password in a Kubernetes secret.

### Authorization (ACLs)

- In the above example, the user my-app-user has Read & Write access to my-topic.

Clients will only be able to perform allowed operations.

## How Clients Retrieve and Use Credentials

After creating a KafkaUser, Strimzi automatically generates a Kubernetes Secret with the credentials.

**Example**

```bash
kubectl get secret my-app-user -o yaml
```

It will contain

#### For TLS authentication

- ca.crt (CA certificate)
- user.crt (Client certificate)
- user.key (Client private key)

#### For SCRAM-SHA authentication

- password (Base64-encoded password)

### Using These Credentials in a Kafka Client

**Example**

##### Java Producer Example (TLS Authentication)


```java
Properties props = new Properties();
props.put("bootstrap.servers", "my-cluster-kafka-bootstrap:9093");
props.put("security.protocol", "SSL");
props.put("ssl.truststore.location", "/etc/secrets/ca.p12");
props.put("ssl.truststore.password", "password");
props.put("ssl.keystore.location", "/etc/secrets/user.p12");
props.put("ssl.keystore.password", "password");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
```

#### Java Consumer Example (SCRAM-SHA Authentication)

```java
Properties props = new Properties();
props.put("bootstrap.servers", "my-cluster-kafka-bootstrap:9093");
props.put("security.protocol", "SASL_SSL");
props.put("sasl.mechanism", "SCRAM-SHA-512");
props.put("sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username='my-app-user' password='my-secret-password';");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

```

### Summary of How Applications Use KafkaTopic & KafkaUser CRs

| Action | Operator Responsible |
| -------- | ------- |
| Developer creates a `KafkaTopic` CR | Topic Operator creates & syncs the topic in Kafka |
| Developer creates a KafkaUser CR | User Operator creates the user & credentials |
| Application retrieves credentials from Kubernetes Secrets | Application mounts the secrets for authentication |
| Application connects to Kafka using these credentials | Producer/Consumer communicates with Kafka |


## Impact of Central Cluster Failure on Kafka Clients in a Stretch Cluster

In stretch Kafka deployment, where

✅ Kafka brokers and controllers are spread across multiple Kubernetes clusters.<br>
✅ The central cluster hosts all Kafka CRs, including Kafka, KafkaNodePool, KafkaUser, and KafkaTopic.<br>
✅ The Entity Operator (managing users & topics) runs in the central cluster.


## What About Entity Operator Functions?
The Entity Operator becomes unavailable when the central cluster goes down. However, this does not impact existing Kafka clients directly because

- Kafka clients do not interact with the Entity Operator at runtime.
- User authentication still works as long as secrets (TLS/SCRAM) were distributed to all clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we nee to add this as long as secrets (TLS/SCRAM) were distributed to all clusters.

I would expect Kafka to store user credentials in ZK / controllers and not rely on topic operator implementation.

- Topics and ACLs remain intact but cannot be updated or created until the central cluster recovers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again topic and user management using custom resources is blocked, but users can create topics/users in Kafka as normal.

Copy link
Owner Author

@aswinayyolath aswinayyolath Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the confusion here is but I will try to share what my understanding is and I'm not claiming that I'll be 100% accurate here... and happy to get corrected, But we need to have a handle on this before we revise or add anything in the proposal related to EO.

Kafka clients don't directly interact with the Entity Operator's components (Topic and User Operators) during their normal message operations. Even EO is not mandatory as far as Strimzi is concerned, but is highly recommended. Existing topics and ACLs will remain operational on the Kafka brokers in the member clusters.

However, it's crucial to understand the limitations and the broader impact of the Central cluster failure in this scenario, especially when considering the use of KafkaUser and KafkaTopic custom resources for management.

While clients might continue to send and receive messages for a period, the absence of the Entity Operator in the Central cluster severely restricts the managed control plane for your Kafka environment. This leads to the following critical implications

  1. Although existing authenticated clients might continue, the ability to create new users, reset passwords, or manage user credentials through the KafkaUser CRs is entirely blocked. This will eventually lead to authentication issues as existing credentials expire or need to be updated.

  2. Any changes to authorization rules defined in KafkaUser CRs cannot be applied. The access control is frozen at the state it was in before the Central cluster failure.

  3. New topics cannot be created, and the configuration of existing topics cannot be modified using the KafkaTopic CRs. This limits the ability of applications to scale, adapt to new requirements, or address issues related to topic configuration.

  4. The core benefit of using Strimzi and Kubernetes for managing Kafka (declarative configuration, automation, reconciliation) is significantly impaired. Relying on direct Kafka API interactions for user and topic management bypasses this managed approach and can lead to inconsistencies and operational challenges.

Therefore, while the immediate impact might not be a complete outage for all connected clients, the failure of the Central cluster, and consequently the Entity Operator, creates a significant degradation in the manageability, security, and adaptability of teh stretched Kafka cluster.


## What Happens If No Cluster Has KafkaUser and KafkaTopic CRs?

If the central cluster is the only one hosting KafkaUser and KafkaTopic CRs, then when it goes down:

1. User Authentication Risks

- Kafka brokers in surviving clusters rely on existing secrets for authentication.
- If KafkaUser secrets were only stored in the central cluster and not replicated, brokers in other clusters will be unable to authenticate client requests.
- New client connections will fail since brokers cannot verify credentials.
- Existing client connections may remain active if they were authenticated before the central cluster failure, but they will eventually be disconnected when session timeouts occur.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure of this behaviour? Managing user credentials and ACLS is a Kafka capability and should not be specific to use of entity operator. So I was not expecting these restrictions.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka itself has its own authentication and authorization capabilities. My intention in this section was to highlight the impact on clients within the context of a Strimzi deployment that utilizes the Entity Operator for managing users and their credentials through Kubernetes Secrets. The restriction isn't on Kafka's core functionality, but on the availability of the managed credentials (e.g., TLS certificates stored in Kubernetes Secrets) that the User Operator in the Central cluster typically manages and might not be automatically replicated across all clusters in a stretched setup. Ensuring proper secret replication across clusters is crucial for maintaining authentication in such scenarios.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka itself enforces authentication if credentials are stored inside Kafka (like SCRAM-SHA passwords). But if authentication relies on Kubernetes Secrets (like TLS certs), and those secrets were not replicated, authentication will fail AFAIK

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re correct that Kafka itself is responsible for enforcing authentication and ACLs, regardless of hte Entity Operator’s availability. However, whether authentication continues to work after a central cluster failure depends on how credentials were managed and distributed

SCRAM credentials are stored within Kafka’s metadata log (KRaft). Since these credentials are part of Kafka’s internal state, authentication will continue to work as long as the metadata remains accessible in the surviving clusters.

TLS certificates are typically managed via Kubernetes Secrets when using the KafkaUser CR. If these Secrets were only stored in the central cluster and not replicated to the member clusters, brokers in surviving clusters will be unable to verify new client connections. In this case, authentication failures will occur for new client connections, even though existing sessions may persist until session timeouts.

Kafka itself enforces ACLs stored in its metadata. Existing ACLs will still be applied, but new ACLs cannot be created or updated until the central cluster and the Entity Operator are restored.


2. Topic Management Limitations

- Topics that were already created will continue to exist and function normally.
- Clients can still produce and consume messages only if they are already authenticated before the central cluster failure.
- No new topics can be created or updated since the KafkaTopic CRs and Entity Operator are unavailable.

### Mitigation Strategies

To ensure Kafka clients remain functional even when the central cluster goes down, we should implement the following best practices

✅ Replicate KafkaUser secrets across all clusters where Kafka brokers exist.

- This ensures authentication remains functional even if the central cluster is unavailable.

✅ Ensure Kafka brokers cache authentication data where possible(This needs verification).

- Some authentication mechanisms (like SCRAM) allow brokers to cache credentials temporarily.
- This can help avoid immediate authentication failures if the central cluster is temporarily down.

✅ Alternatively we can Explore options like KafkaAccess Operator. This reduces dependency on a single cluster for authentication.