Skip to content

Commit 224cb54

Browse files
committed
running at scale
1 parent aa8169e commit 224cb54

File tree

3 files changed

+204
-0
lines changed

3 files changed

+204
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
:_content-type: SNIPPET
2+
3+
[NOTE]
4+
====
5+
You can find more details about importing untrusted TLS certificates in the link:https://eclipse.dev/che/docs/stable/administration-guide/importing-untrusted-tls-certificates/[official documentation].
6+
====

modules/administration-guide/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.Administration Guide
22

33
* xref:security-best-practices.adoc[]
4+
* xref:running-at-scale.adoc[]
45
* xref:preparing-the-installation.adoc[]
56
** xref:supported-platforms.adoc[]
67
** xref:installing-the-chectl-management-tool.adoc[]
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
:_content-type: CONCEPT
2+
:description: Running {prod} at scale
3+
:keywords: scale, infrastructure, workload, scalability, CDE, cloud
4+
:navtitle: Running {prod} at scale
5+
//:page-aliases:
6+
7+
[id="running-at-scale"]
8+
= Running {prod} at scale
9+
10+
Even though link:https://kubernetes.io/[Kubernetes] has emerged as a powerful foundation for deploying and managing containerized workloads at scale, achieving scale with Cloud Development Environments (CDEs), particularly in the range of thousands of concurrent workspaces, presents significant challenges.
11+
12+
Such a scale imposes high infrastructure demands and introduces potential bottlenecks that can impact the performance and stability of the entire system. Addressing these challenges requires meticulous planning, strategic architectural choices, monitoring, and continuous optimization to ensure a seamless and efficient development experience for a large number of users.
13+
14+
[NOTE]
15+
====
16+
CDE workloads are complex to scale mainly because of the fact that underlying IDE solutions, such as link:https://github.com/microsoft/vscode[Visual Studio Code - Open Source ("Code - OSS")] or link:https://www.jetbrains.com/remote-development/gateway/[JetBrains Gateway], are designed as single-user applications rather than multitenant services.
17+
====
18+
19+
.Resource quantity and object maximums
20+
21+
While there is no strict limit on the number of resources in a {kubernetes} cluster,
22+
there are certain link:https://kubernetes.io/docs/setup/best-practices/cluster-large/[considerations for large clusters] to keep in mind.
23+
24+
[NOTE]
25+
====
26+
Learn more about the {kubernetes} scalability in the link:https://kubernetespodcast.com/episode/111-scalability/["Scalability, with Wojciech Tyczynski"] {kubernetes} Podcast.
27+
====
28+
29+
link:https://www.redhat.com/en/technologies/cloud-computing/openshift[OpenShift Container Platform], which is a certified distribution of {kubernetes}, also provides a set of tested maximums for various resources, which can serve as an initial guideline for planning your environment:
30+
31+
[%header,format=csv]
32+
|===
33+
Resource type, Tested maximum
34+
Number of nodes,2000
35+
Number of pods,150000
36+
Number of pods per node,2500
37+
Number of namespace,10000
38+
Number of services,10000
39+
Number of secrets,80000
40+
Number of config maps,90000
41+
|===
42+
43+
Table 1: OpenShift Container Platform tested cluster maximums for various resources.
44+
45+
[NOTE]
46+
====
47+
You can find more details on the OpenShift Container Platform tested object maximums in the link:https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/scalability_and_performance/planning-your-environment-according-to-object-maximums#planning-your-environment-according-to-object-maximums[official documentation].
48+
====
49+
50+
For example, it is generally not recommended to have more than 10,000 namespaces due to potential performance and management overhead. In {prod}, each user is allocated a namespace. If you expect the user base to be large, consider spreading workloads across multiple "fit-for-purpose" clusters and potentially leveraging solutions for multi-cluster orchestration.
51+
52+
.Resource requirements
53+
54+
When deploying {prod} on {kubernetes}, it is crucial to accurately calculate the resource requirements and determine memory and CPU / GPU needs for each CDE to come up with the right sizing of the cluster. In general, the CDE size is limited by and can not be bigger than the worker node size. The resource requirements for CDEs can vary significantly based on the specific workloads and configurations. For example, a simple CDE may require only a few hundred megabytes of memory, while a more complex one will need several gigabytes of memory and multiple CPU cores.
55+
56+
[NOTE]
57+
====
58+
You can find more details about calculating resource requirements in the link:https://eclipse.dev/che/docs/stable/administration-guide/calculating-che-resource-requirements/[official documentation].
59+
====
60+
61+
.Using etcd
62+
63+
The primary datastore of {kubernetes} cluster configuration and state is link:https://etcd.io/[etcd]. It holds the cluster state and configuration, including information about nodes, pods, services, and custom resources. As a distributed key-value store, etcd does not scale well past a certain threshold, and as the size of etcd grows, so does the load on the cluster, risking its stability.
64+
65+
[IMPORTANT]
66+
====
67+
The default etcd size is 2 GB, and the recommended maximum is 8 GB. Exceeding the maximum limit can make the {kubernetes} cluster unstable and unresponsive.
68+
====
69+
70+
.Object size as a factor
71+
72+
Not only the overall number, but also the size of the objects stored in etcd is a critical factor that can significantly impact its performance and stability. Each object stored in etcd consumes space, and as the number of objects increases, the overall size of etcd grows too. The larger the object is, the more space it takes in etcd. For example, etcd can be overloaded with just a few thousand of relatively big {kubernetes} objects.
73+
74+
[IMPORTANT]
75+
====
76+
Even though the data stored in a `ConfigMap` cannot exceed 1 MiB by design, a few thousand of relatively big `ConfigMap` objects can overload etcd storage.
77+
====
78+
79+
In the context of {prod}, by default the operator creates and manages the 'ca-certs-merged' ConfigMap, which contains the Certificate Authorities (CAs) bundle, in every user namespace. With a large number of TLS certificates in the cluster, this results in additional etcd usage.
80+
81+
To disable mounting the CA bundle using the `ConfigMap` under the `/etc/pki/ca-trust/extracted/pem` path, configure the `CheCluster` Custom Resource by setting the `disableWorkspaceCaBundleMount` property to `true`. With this configuration, only custom certificates will be mounted under the path `/public-certs`:
82+
83+
[source,yaml]
84+
----
85+
spec:
86+
devEnvironments:
87+
trustedCerts:
88+
disableWorkspaceCaBundleMount: true
89+
----
90+
91+
include::example$snip_che-running-at-scale-untrusted-cert.adoc[]
92+
93+
.DevWorkspace objects
94+
95+
For large {kubernetes} deployments, particularly those involving a high number of custom resources such as `DevWorkspace` objects, which represent CDEs, etcd can become a significant performance bottleneck.
96+
97+
[IMPORTANT]
98+
====
99+
Based on the load testing for 6,000 `DevWorkspace` objects, storage consumption for etcd was approximately 2.5GB.
100+
====
101+
102+
Starting from link:https://github.com/devfile/devworkspace-operator[DevWorkspace Operator] version 0.34.0,
103+
you can configure a pruner
104+
that automatically cleans up `DevWorkspace` objects that were not in use for a certain period of time.
105+
To set the pruner up, configure the `DevWorkspaceOperatorConfig` object as follows:
106+
107+
[source,yaml]
108+
----
109+
apiVersion: controller.devfile.io/v1alpha1
110+
kind: DevWorkspaceOperatorConfig
111+
metadata:
112+
name: devworkspace-operator-config
113+
namespace: crw
114+
config:
115+
workspace:
116+
cleanupCronJob:
117+
enabled: true
118+
dryRun: false
119+
retainTime: 2592000 # By default, if a workspace was not started for more than 30 days it will be marked for deletion
120+
schedule: “0 0 1 * *” # By default, the pruner will run once per month
121+
----
122+
123+
[NOTE]
124+
====
125+
You can find more details about DevWorkspace Operator Configuration in the link:https://github.com/devfile/devworkspace-operator/blob/main/docs/dwo-configuration.md[official documentation].
126+
====
127+
128+
.OLMConfig
129+
130+
When an operator is installed by the link:https://olm.operatorframework.io/[Operator Lifecycle Manager (OLM)],
131+
a stripped-down copy of its CSV is created in every namespace the operator is configured to watch.
132+
These stripped-down CSVs are known as “Copied CSVs”
133+
and communicate to users which controllers are actively reconciling resource events in a given namespace.
134+
On especially large clusters, with namespaces and installed operators tending in the hundreds or thousands,
135+
Copied CSVs consume an untenable amount of resources; e.g. OLM’s memory usage, cluster etcd limits,
136+
networking, etc. To eliminate the CSVs copied to every namespace, configure the `OLMConfig` object accordingly:
137+
138+
[source,yaml]
139+
----
140+
apiVersion: operators.coreos.com/v1
141+
kind: OLMConfig
142+
metadata:
143+
name: cluster
144+
spec:
145+
features:
146+
disableCopiedCSVs: true
147+
----
148+
149+
[NOTE]
150+
====
151+
Additional information about the `disableCopiedCSVs` feature is available in its original link:https://github.com/operator-framework/enhancements/blob/master/enhancements/olm-toggle-copied-csvs.md[enhancement proposal].
152+
====
153+
154+
The primary impact of the `disableCopiedCSVs` property on etcd is related to resource consumption. In clusters with a large number of namespaces and many cluster-wide Operators, the creation and maintenance of numerous Copied CSVs can lead to increased etcd storage usage and memory consumption. By disabling Copied CSVs, the amount of data stored in etcd is significantly reduced, which can help improve overall cluster performance and stability.
155+
156+
This is particularly important for large clusters where the number of namespaces and operators can quickly add up to a significant amount of data. Disabling Copied CSVs can help reduce the load on etcd, leading to improved performance and responsiveness of the cluster.
157+
Additionally, it can help reduce the memory footprint of OLM, as it no longer needs to maintain and manage these additional resources.
158+
159+
[NOTE]
160+
====
161+
You can find more details about "Disabling Copied CSVs" in the link:https://olm.operatorframework.io/docs/advanced-tasks/configuring-olm/#disabling-copied-csvs[official documentation].
162+
====
163+
164+
.Cluster Autoscaling
165+
166+
Although cluster autoscaling is a powerful {kubernetes} feature, you cannot always fall back on it. You should always consider predictive scaling by analyzing load data on your environment to detect daily or weekly usage patterns. If your workloads follow a pattern and there are dramatic peaks throughout the day, you should consider provisioning worker nodes accordingly. For example, if you have a predictable load pattern where the number of workspaces increases during business hours and decreases during off-hours, you can use predictive scaling to adjust the number of worker nodes based on the expected load.
167+
This can help ensure that you have enough resources available to handle the peak load while minimizing costs during off-peak hours.
168+
169+
[NOTE]
170+
====
171+
Consider leveraging open-source solutions such as link:https://karpenter.sh/[Karpenter] for configuration and lifecycle management of the worker nodes. Karpenter can dynamically provision and optimize worker nodes based on the specific requirements of the workloads, helping to improve resource utilization and reduce costs.
172+
====
173+
174+
.Multi-cluster
175+
176+
By design, {prod} is not multi-cluster aware, and you can only have one instance per cluster.
177+
However,
178+
you can run {prod} in a multi-cluster environment by deploying {prod} in each different cluster
179+
and using a load balancer or DNS-based routing to direct traffic to the appropriate instance based on the user’s location or other criteria.
180+
This approach can help
181+
improve performance and reliability by distributing the workload across multiple clusters
182+
and providing redundancy in case of cluster failures.
183+
184+
[NOTE]
185+
====
186+
187+
====
188+
189+
From the infrastructure perspective, the Developer Sandbox consists of multiple link:https://www.redhat.com/en/technologies/cloud-computing/openshift/aws[ROSA] clusters. On each cluster, the productized version of {prod} is installed and configured using link:https://argo-cd.readthedocs.io/en/stable/[Argo CD]. Since the user base is spread across multiple clusters, link:https://workspaces.openshift.com/[workspaces.openshift.com] is used as a single entry point to the productized {prod} instances. You can find implementation details about the multicluster redirector in the following link:https://github.com/codeready-toolchain/crw-multicluster-redirector[GitHub repository].
190+
191+
[IMPORTANT]
192+
====
193+
The multi-cluster architecture of link:https://workspaces.openshift.com/[workspaces.openshift.com] is part of the link:https://developers.redhat.com/developer-sandbox[Developer Sandbox].
194+
It is a Developer Sandbox-specific solution
195+
that cannot be reused as-is in other environments.
196+
However, you can use it as a reference for implementing a similar solution well-tailored to your specific multicluster needs.
197+
====

0 commit comments

Comments
 (0)