Add support for ParallelCluster 3.13.0 (#316)

cartalla · web-flow · commit 673567ba046e · 2025-05-12T12:12:49.000-05:00
Update xio and xwo controller and worker name tags. Update Exostellar documentation. Resolves #314
diff --git a/docs/debug.md b/docs/debug.md
@@ -2,6 +2,28 @@
 
 For ParallelCluster and Slurm issues, refer to the official [AWS ParallelCluster Troubleshooting documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html).
 
+## Config stack deploys, but ParallelCluster stack doesn't
+
+This happens when the lambda function that create the cluster encounters an error.
+This is usually some kind of configuration error that is detected by ParallelCluster.
+
+* Open the CloudWatch console and go the log groups
+* Find the log group named /aws/lambda/*-CreateParallelCluster
+* Look for the error
+
+## ParallelCluster stack creation fails
+
+### HeadNodeWaitCondition failed to create
+
+If the stack fails with an error like:
+
+```The following resoure(s) failed to create
+[HeadNodeWaitCondition2025050101134602]```
+
+Connect to the head node and look in `/var/log/ansible.log` for errors.
+
+If it shows that it failed waiting for slurmctld to accept requests then check `/var/log/slurmctld.log` for errors.
+
 ## Slurm Head Node
 
 If slurm commands hang, then it's likely a problem with the Slurm controller.
diff --git a/docs/deployment-prerequisites.md b/docs/deployment-prerequisites.md
@@ -44,11 +44,11 @@ Simply install the newer version and then use it to create and activate a virtua
 ```
 $ python3 --version
 Python 3.6.8
-$ yum -y install python3.11
-$ python3.11 -m venv ~/.venv-python3.11
-$ source ~/.venv-python3.11/bin/activate
+$ yum -y install python3.12
+$ python3.12 -m venv ~/.venv-python3.12
+$ source ~/.venv-python3.12/bin/activate
 $ python3 --version
-Python 3.11.5
+Python 3.12.8
 ```
 
 ## Make sure required packages are installed
@@ -81,12 +81,12 @@ Follow the instructions for Python.
 
 [https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites)
 
-Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, [https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz](https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz)
+Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, [https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz](https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz)
 
 ```
 sudo yum -y install wget
-wget https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz
-tar -xf node-v16.13.1-linux-x64.tar.xz ~
+wget https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz
+tar -xf node-v20.19.0-linux-x64.tar.xz ~
 ```
 
 Add the nodjs bin directory to your path.
@@ -144,6 +144,13 @@ Follow the directions in this [ParallelCluster tutorial to configure slurm accou
 The recommended Slurm architecture is to have a shared slurmdbd daemon that is used by all of the clusters.
 Starting in version 3.10.0, ParallelCluster supports specifying an external slurmdbd instance when you create a cluster and provide a cloud formation template to create it.
 
+**Note**: The Slurm version used by slurmdbd must be greater than or equal to the version of your clusters.
+If you have already deployed a slurmdbd instance then you will need to create a new slurmdbd
+instance with the latest version of ParallelCluster.
+Also note that Slurm only maintains backwards compatibility for the 2 previous major releases so
+at some point you will need upgrade your clusters to newer versions before you can use the latest version
+of ParallelCluster.
+
 Follow the directions in this [ParallelCluster tutorial to configure slurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1).
 This requires that you have already created the slurm database.
 
diff --git a/docs/exostellar-infrastructure-optimizer.md b/docs/exostellar-infrastructure-optimizer.md
@@ -663,6 +663,21 @@ srun --pty -p xio-
 
 ## Debug
 
+### How to connect to EMS
+
+Use ssh to connect to the EMS using your EC2 keypair.
+
+* `ssh-add private-key.pem`
+* `ssh -A rocky@${EMS_IP_ADDRESS}`
+
+You can [install the aws-ssm-agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html) so that you can connect from the EC2 console using SSM.
+
+### How to connect to Controller
+
+* First ssh to the EMS.
+* Get the IP address of the controller from the EC2 console
+* As root, ssh to the controller
+
 ### UpdateHeadNode resource failed
 
 If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
@@ -676,6 +691,10 @@ When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FA
 Before you can update it again you will need to complete the rollback.
 Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
 
+The problem is usually that there is an XWO controller running that is preventing updates to
+the profile.
+Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.
+
 ### XIO Controller not starting
 
 On EMS, check that a job is running to create the controller.
@@ -686,7 +705,7 @@ On EMS, check the autoscaling log to see if there are errors starting the instan
 
 `less /var/log/slurm/autoscaling.log`
 
-EMS Slurm partions are at:
+EMS Slurm partitions are at:
 
 `/xcompute/slurm/bin/partitions.json`
 
@@ -696,4 +715,23 @@ They are derived from the partition and pool names.
 
 ### VM not starting on worker
 
+Connect to the controller instance and run the following command to get a list of worker instances and VMs.
+
+```
+xspot ps
+```
+
+Connect to the worker VM using the following command.
+
+```
+xspot console vm-abcd
+```
+
+This will show the console logs.
+If you configured the root password then you can log in as root to do further debug.
+
 ### VM not starting Slurm job
+
+Connect to the VM as above.
+
+Check /var/log/slurmd.log for errors.
diff --git a/docs/exostellar-workload-optimizer.md b/docs/exostellar-workload-optimizer.md
@@ -195,6 +195,21 @@ srun --pty -p xwo-amd-64g-4c hostname
 
 ## Debug
 
+### How to connect to EMS
+
+Use ssh to connect to the EMS using your EC2 keypair.
+
+* `ssh-add private-key.pem`
+* `ssh -A rocky@${EMS_IP_ADDRESS}`
+
+You can [install the aws-ssm-agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html) so that you can connect from the EC2 console using SSM.
+
+### How to connect to Controller
+
+* First ssh to the EMS.
+* Get the IP address of the controller from the EC2 console
+* As root, ssh to the controller
+
 ### UpdateHeadNode resource failed
 
 If the UpdateHeadNode resource fails then it is usually because a task in the ansible script failed.
@@ -208,6 +223,10 @@ When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FA
 Before you can update it again you will need to complete the rollback.
 Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
 
+The problem is usually that there is an XWO controller running that is preventing updates to
+the profile.
+Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.
+
 ### XWO Controller not starting
 
 If a controller doesn't start, then the first thing to check is to make sure that the
@@ -227,7 +246,7 @@ On EMS, check the autoscaling log to see if there are errors starting the instan
 
 `less /var/log/slurm/autoscaling.log`
 
-EMS Slurm partions are at:
+EMS Slurm partitions are at:
 
 `/xcompute/slurm/bin/partitions.json`
 
@@ -237,4 +256,23 @@ They are derived from the partition and pool names.
 
 ### VM not starting on worker
 
+Connect to the controller instance and run the following command to get a list of worker instances and VMs.
+
+```
+xspot ps
+```
+
+Connect to the worker VM using the following command.
+
+```
+xspot console vm-abcd
+```
+
+This will show the console logs.
+If you configured the root password then you can log in as root to do further debug.
+
 ### VM not starting Slurm job
+
+Connect to the VM as above.
+
+Check /var/log/slurmd.log for errors.
diff --git a/source/cdk/config_schema.py b/source/cdk/config_schema.py
@@ -99,6 +99,9 @@
 #     * Upgrade libjwt to version 1.17.0.
 # 3.12.0:
 #     * OpenZFS security group requirements fixed.
+# 3.13.0:
+#     * Upgrade Slurm to 24.05.07
+#     * Upgrade to Python 3.12.8
 MIN_PARALLEL_CLUSTER_VERSION = parse_version('3.6.0')
 # Update source/resources/default_config.yml with latest version when this is updated.
 PARALLEL_CLUSTER_VERSIONS = [
@@ -117,18 +120,21 @@
     '3.11.0',
     '3.11.1',
     '3.12.0',
+    '3.13.0',
 ]
 PARALLEL_CLUSTER_ENROOT_VERSIONS = {
     # This can be found on the head node by running 'yum info enroot'
     '3.11.0':  '3.4.1', # confirmed
     '3.11.1':  '3.4.1', # confirmed
     '3.12.0':  '3.4.1', # confirmed
+    '3.13.0':  '3.4.1', # confirmed
 }
 PARALLEL_CLUSTER_PYXIS_VERSIONS = {
     # This can be found on the head node at /opt/parallelcluster/sources
     '3.11.0':  '0.20.0', # confirmed
     '3.11.1':  '0.20.0', # confirmed
     '3.12.0':  '0.20.0', # confirmed
+    '3.13.0':  '0.20.0', # confirmed
 }
 PARALLEL_CLUSTER_MUNGE_VERSIONS = {
     # This can be found on the head node at /opt/parallelcluster/sources
@@ -148,6 +154,7 @@
     '3.11.0':  '0.5.16', # confirmed
     '3.11.1':  '0.5.16', # confirmed
     '3.12.0':  '0.5.16', # confirmed
+    '3.13.0':  '0.5.16', # confirmed
 }
 PARALLEL_CLUSTER_PYTHON_VERSIONS = {
     # This can be found on the head node at /opt/parallelcluster/pyenv/versions
@@ -166,6 +173,7 @@
     '3.11.0':  '3.9.20', # confirmed
     '3.11.1':  '3.9.20', # confirmed
     '3.12.0':  '3.9.20', # confirmed
+    '3.13.0':  '3.12.0', # confirmed
 }
 PARALLEL_CLUSTER_SLURM_VERSIONS = {
     # This can be found on the head node at /etc/chef/local-mode-cache/cache/
@@ -184,6 +192,7 @@
     '3.11.0':  '23.11.10', # confirmed
     '3.11.1':  '23.11.10', # confirmed
     '3.12.0':  '23.11.10', # confirmed
+    '3.13.0':  '24.05.7',  # confirmed
 }
 PARALLEL_CLUSTER_PC_SLURM_VERSIONS = {
     # This can be found on the head node at /etc/chef/local-mode-cache/cache/
@@ -202,6 +211,7 @@
     '3.11.0':  '23-11-10-1', # confirmed
     '3.11.1':  '23-11-10-1', # confirmed
     '3.12.0':  '23-11-10-1', # confirmed
+    '3.13.0':  '24-05-7-1',  # confirmed
 }
 SLURM_REST_API_VERSIONS = {
     '23-02-2-1': '0.0.39',
@@ -213,6 +223,7 @@
     '23-11-4-1': '0.0.39',
     '23-11-7-1': '0.0.39',
     '23-11-10-1': '0.0.39',
+    '24-05-7-1': '0.0.39',
 }
 
 def get_parallel_cluster_version(config):
@@ -376,9 +387,11 @@ def PARALLEL_CLUSTER_REQUIRES_FSXZ_OUTBOUND_SG_RULES(parallel_cluster_version):
 
 # Controller needs at least 4 GB  or will hit OOM
 
-DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE = 'c6g.large'
+# Head node needs at least 13.8 GB
+DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE = 'm6g.xlarge'
 
-DEFAULT_X86_CONTROLLER_INSTANCE_TYPE = 'c6a.large'
+# Head node needs at least 13.8 GB
+DEFAULT_X86_CONTROLLER_INSTANCE_TYPE = 'm6a.xlarge'
 
 def default_controller_instance_type(config):
     architecture = config['slurm']['ParallelClusterConfig'].get('Architecture', DEFAULT_ARCHITECTURE)
diff --git a/source/resources/playbooks/roles/exostellar_infrastructure_optimizer/files/opt/slurm/etc/exostellar/configure_xio.py b/source/resources/playbooks/roles/exostellar_infrastructure_optimizer/files/opt/slurm/etc/exostellar/configure_xio.py
@@ -202,7 +202,7 @@ def configure_profile(self, profile_config, template_profile_config):
         # Set profile specific fields from the config
         profile['ProfileName'] = profile_name
         profile['NodeGroupName'] = profile_name
-        name_tag = f"xspot-controller-{profile_name}"
+        name_tag = f"xio-controller-{profile_name}"
         name_tag_found = False
         for tag_dict in profile['Controller']['InstanceTags']:
             if tag_dict['Key'] == 'Name':
@@ -221,7 +221,7 @@ def configure_profile(self, profile_config, template_profile_config):
         for spot_fleet_type in profile_config['SpotFleetTypes']:
             profile['Worker']['SpotFleetTypes'].append(spot_fleet_type)
         name_tag_found = False
-        name_tag = f"xspot-worker-{profile_name}"
+        name_tag = f"xio-worker-{profile_name}"
         for tag_dict in profile['Worker']['InstanceTags']:
             if tag_dict['Key'] == 'Name':
                 name_tag_found = True
diff --git a/source/resources/playbooks/roles/exostellar_workload_optimizer/files/opt/slurm/etc/exostellar/configure_xwo.py b/source/resources/playbooks/roles/exostellar_workload_optimizer/files/opt/slurm/etc/exostellar/configure_xwo.py
@@ -201,7 +201,7 @@ def configure_profile(self, profile_name, profile_config, template_profile_confi
         # Set profile specific fields from the config
         profile['ProfileName'] = profile_name
         profile['NodeGroupName'] = profile_name
-        name_tag = f"xspot-controller-{profile_name}"
+        name_tag = f"xwo-controller-{profile_name}"
         name_tag_found = False
         for tag_dict in profile['Controller']['InstanceTags']:
             if tag_dict['Key'] == 'Name':
@@ -220,7 +220,7 @@ def configure_profile(self, profile_name, profile_config, template_profile_confi
         for spot_fleet_type in profile_config['SpotFleetTypes']:
             profile['Worker']['SpotFleetTypes'].append(spot_fleet_type)
         name_tag_found = False
-        name_tag = f"xspot-worker-{profile_name}"
+        name_tag = f"xwo-worker-{profile_name}"
         for tag_dict in profile['Worker']['InstanceTags']:
             if tag_dict['Key'] == 'Name':
                 name_tag_found = True