Skip to content

Commit 673567b

Browse files
authored
Add support for ParallelCluster 3.13.0 (#316)
Update xio and xwo controller and worker name tags. Update Exostellar documentation. Resolves #314
1 parent ea72172 commit 673567b

File tree

7 files changed

+133
-15
lines changed

7 files changed

+133
-15
lines changed

docs/debug.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,28 @@
22

33
For ParallelCluster and Slurm issues, refer to the official [AWS ParallelCluster Troubleshooting documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html).
44

5+
## Config stack deploys, but ParallelCluster stack doesn't
6+
7+
This happens when the lambda function that create the cluster encounters an error.
8+
This is usually some kind of configuration error that is detected by ParallelCluster.
9+
10+
* Open the CloudWatch console and go the log groups
11+
* Find the log group named /aws/lambda/*-CreateParallelCluster
12+
* Look for the error
13+
14+
## ParallelCluster stack creation fails
15+
16+
### HeadNodeWaitCondition failed to create
17+
18+
If the stack fails with an error like:
19+
20+
```The following resoure(s) failed to create
21+
[HeadNodeWaitCondition2025050101134602]```
22+
23+
Connect to the head node and look in `/var/log/ansible.log` for errors.
24+
25+
If it shows that it failed waiting for slurmctld to accept requests then check `/var/log/slurmctld.log` for errors.
26+
527
## Slurm Head Node
628
729
If slurm commands hang, then it's likely a problem with the Slurm controller.

docs/deployment-prerequisites.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,11 @@ Simply install the newer version and then use it to create and activate a virtua
4444
```
4545
$ python3 --version
4646
Python 3.6.8
47-
$ yum -y install python3.11
48-
$ python3.11 -m venv ~/.venv-python3.11
49-
$ source ~/.venv-python3.11/bin/activate
47+
$ yum -y install python3.12
48+
$ python3.12 -m venv ~/.venv-python3.12
49+
$ source ~/.venv-python3.12/bin/activate
5050
$ python3 --version
51-
Python 3.11.5
51+
Python 3.12.8
5252
```
5353

5454
## Make sure required packages are installed
@@ -81,12 +81,12 @@ Follow the instructions for Python.
8181

8282
[https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites)
8383

84-
Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, [https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz](https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz)
84+
Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, [https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz](https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz)
8585

8686
```
8787
sudo yum -y install wget
88-
wget https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz
89-
tar -xf node-v16.13.1-linux-x64.tar.xz ~
88+
wget https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz
89+
tar -xf node-v20.19.0-linux-x64.tar.xz ~
9090
```
9191

9292
Add the nodjs bin directory to your path.
@@ -144,6 +144,13 @@ Follow the directions in this [ParallelCluster tutorial to configure slurm accou
144144
The recommended Slurm architecture is to have a shared slurmdbd daemon that is used by all of the clusters.
145145
Starting in version 3.10.0, ParallelCluster supports specifying an external slurmdbd instance when you create a cluster and provide a cloud formation template to create it.
146146

147+
**Note**: The Slurm version used by slurmdbd must be greater than or equal to the version of your clusters.
148+
If you have already deployed a slurmdbd instance then you will need to create a new slurmdbd
149+
instance with the latest version of ParallelCluster.
150+
Also note that Slurm only maintains backwards compatibility for the 2 previous major releases so
151+
at some point you will need upgrade your clusters to newer versions before you can use the latest version
152+
of ParallelCluster.
153+
147154
Follow the directions in this [ParallelCluster tutorial to configure slurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1).
148155
This requires that you have already created the slurm database.
149156

docs/exostellar-infrastructure-optimizer.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -663,6 +663,21 @@ srun --pty -p xio-
663663

664664
## Debug
665665

666+
### How to connect to EMS
667+
668+
Use ssh to connect to the EMS using your EC2 keypair.
669+
670+
* `ssh-add private-key.pem`
671+
* `ssh -A rocky@${EMS_IP_ADDRESS}`
672+
673+
You can [install the aws-ssm-agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html) so that you can connect from the EC2 console using SSM.
674+
675+
### How to connect to Controller
676+
677+
* First ssh to the EMS.
678+
* Get the IP address of the controller from the EC2 console
679+
* As root, ssh to the controller
680+
666681
### UpdateHeadNode resource failed
667682

668683
If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
@@ -676,6 +691,10 @@ When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FA
676691
Before you can update it again you will need to complete the rollback.
677692
Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
678693

694+
The problem is usually that there is an XWO controller running that is preventing updates to
695+
the profile.
696+
Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.
697+
679698
### XIO Controller not starting
680699

681700
On EMS, check that a job is running to create the controller.
@@ -686,7 +705,7 @@ On EMS, check the autoscaling log to see if there are errors starting the instan
686705

687706
`less /var/log/slurm/autoscaling.log`
688707

689-
EMS Slurm partions are at:
708+
EMS Slurm partitions are at:
690709

691710
`/xcompute/slurm/bin/partitions.json`
692711

@@ -696,4 +715,23 @@ They are derived from the partition and pool names.
696715

697716
### VM not starting on worker
698717

718+
Connect to the controller instance and run the following command to get a list of worker instances and VMs.
719+
720+
```
721+
xspot ps
722+
```
723+
724+
Connect to the worker VM using the following command.
725+
726+
```
727+
xspot console vm-abcd
728+
```
729+
730+
This will show the console logs.
731+
If you configured the root password then you can log in as root to do further debug.
732+
699733
### VM not starting Slurm job
734+
735+
Connect to the VM as above.
736+
737+
Check /var/log/slurmd.log for errors.

docs/exostellar-workload-optimizer.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,21 @@ srun --pty -p xwo-amd-64g-4c hostname
195195

196196
## Debug
197197

198+
### How to connect to EMS
199+
200+
Use ssh to connect to the EMS using your EC2 keypair.
201+
202+
* `ssh-add private-key.pem`
203+
* `ssh -A rocky@${EMS_IP_ADDRESS}`
204+
205+
You can [install the aws-ssm-agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html) so that you can connect from the EC2 console using SSM.
206+
207+
### How to connect to Controller
208+
209+
* First ssh to the EMS.
210+
* Get the IP address of the controller from the EC2 console
211+
* As root, ssh to the controller
212+
198213
### UpdateHeadNode resource failed
199214

200215
If the UpdateHeadNode resource fails then it is usually because a task in the ansible script failed.
@@ -208,6 +223,10 @@ When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FA
208223
Before you can update it again you will need to complete the rollback.
209224
Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
210225

226+
The problem is usually that there is an XWO controller running that is preventing updates to
227+
the profile.
228+
Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.
229+
211230
### XWO Controller not starting
212231

213232
If a controller doesn't start, then the first thing to check is to make sure that the
@@ -227,7 +246,7 @@ On EMS, check the autoscaling log to see if there are errors starting the instan
227246

228247
`less /var/log/slurm/autoscaling.log`
229248

230-
EMS Slurm partions are at:
249+
EMS Slurm partitions are at:
231250

232251
`/xcompute/slurm/bin/partitions.json`
233252

@@ -237,4 +256,23 @@ They are derived from the partition and pool names.
237256

238257
### VM not starting on worker
239258

259+
Connect to the controller instance and run the following command to get a list of worker instances and VMs.
260+
261+
```
262+
xspot ps
263+
```
264+
265+
Connect to the worker VM using the following command.
266+
267+
```
268+
xspot console vm-abcd
269+
```
270+
271+
This will show the console logs.
272+
If you configured the root password then you can log in as root to do further debug.
273+
240274
### VM not starting Slurm job
275+
276+
Connect to the VM as above.
277+
278+
Check /var/log/slurmd.log for errors.

source/cdk/config_schema.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,9 @@
9999
# * Upgrade libjwt to version 1.17.0.
100100
# 3.12.0:
101101
# * OpenZFS security group requirements fixed.
102+
# 3.13.0:
103+
# * Upgrade Slurm to 24.05.07
104+
# * Upgrade to Python 3.12.8
102105
MIN_PARALLEL_CLUSTER_VERSION = parse_version('3.6.0')
103106
# Update source/resources/default_config.yml with latest version when this is updated.
104107
PARALLEL_CLUSTER_VERSIONS = [
@@ -117,18 +120,21 @@
117120
'3.11.0',
118121
'3.11.1',
119122
'3.12.0',
123+
'3.13.0',
120124
]
121125
PARALLEL_CLUSTER_ENROOT_VERSIONS = {
122126
# This can be found on the head node by running 'yum info enroot'
123127
'3.11.0': '3.4.1', # confirmed
124128
'3.11.1': '3.4.1', # confirmed
125129
'3.12.0': '3.4.1', # confirmed
130+
'3.13.0': '3.4.1', # confirmed
126131
}
127132
PARALLEL_CLUSTER_PYXIS_VERSIONS = {
128133
# This can be found on the head node at /opt/parallelcluster/sources
129134
'3.11.0': '0.20.0', # confirmed
130135
'3.11.1': '0.20.0', # confirmed
131136
'3.12.0': '0.20.0', # confirmed
137+
'3.13.0': '0.20.0', # confirmed
132138
}
133139
PARALLEL_CLUSTER_MUNGE_VERSIONS = {
134140
# This can be found on the head node at /opt/parallelcluster/sources
@@ -148,6 +154,7 @@
148154
'3.11.0': '0.5.16', # confirmed
149155
'3.11.1': '0.5.16', # confirmed
150156
'3.12.0': '0.5.16', # confirmed
157+
'3.13.0': '0.5.16', # confirmed
151158
}
152159
PARALLEL_CLUSTER_PYTHON_VERSIONS = {
153160
# This can be found on the head node at /opt/parallelcluster/pyenv/versions
@@ -166,6 +173,7 @@
166173
'3.11.0': '3.9.20', # confirmed
167174
'3.11.1': '3.9.20', # confirmed
168175
'3.12.0': '3.9.20', # confirmed
176+
'3.13.0': '3.12.0', # confirmed
169177
}
170178
PARALLEL_CLUSTER_SLURM_VERSIONS = {
171179
# This can be found on the head node at /etc/chef/local-mode-cache/cache/
@@ -184,6 +192,7 @@
184192
'3.11.0': '23.11.10', # confirmed
185193
'3.11.1': '23.11.10', # confirmed
186194
'3.12.0': '23.11.10', # confirmed
195+
'3.13.0': '24.05.7', # confirmed
187196
}
188197
PARALLEL_CLUSTER_PC_SLURM_VERSIONS = {
189198
# This can be found on the head node at /etc/chef/local-mode-cache/cache/
@@ -202,6 +211,7 @@
202211
'3.11.0': '23-11-10-1', # confirmed
203212
'3.11.1': '23-11-10-1', # confirmed
204213
'3.12.0': '23-11-10-1', # confirmed
214+
'3.13.0': '24-05-7-1', # confirmed
205215
}
206216
SLURM_REST_API_VERSIONS = {
207217
'23-02-2-1': '0.0.39',
@@ -213,6 +223,7 @@
213223
'23-11-4-1': '0.0.39',
214224
'23-11-7-1': '0.0.39',
215225
'23-11-10-1': '0.0.39',
226+
'24-05-7-1': '0.0.39',
216227
}
217228

218229
def get_parallel_cluster_version(config):
@@ -376,9 +387,11 @@ def PARALLEL_CLUSTER_REQUIRES_FSXZ_OUTBOUND_SG_RULES(parallel_cluster_version):
376387

377388
# Controller needs at least 4 GB or will hit OOM
378389

379-
DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE = 'c6g.large'
390+
# Head node needs at least 13.8 GB
391+
DEFAULT_ARM_CONTROLLER_INSTANCE_TYPE = 'm6g.xlarge'
380392

381-
DEFAULT_X86_CONTROLLER_INSTANCE_TYPE = 'c6a.large'
393+
# Head node needs at least 13.8 GB
394+
DEFAULT_X86_CONTROLLER_INSTANCE_TYPE = 'm6a.xlarge'
382395

383396
def default_controller_instance_type(config):
384397
architecture = config['slurm']['ParallelClusterConfig'].get('Architecture', DEFAULT_ARCHITECTURE)

source/resources/playbooks/roles/exostellar_infrastructure_optimizer/files/opt/slurm/etc/exostellar/configure_xio.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ def configure_profile(self, profile_config, template_profile_config):
202202
# Set profile specific fields from the config
203203
profile['ProfileName'] = profile_name
204204
profile['NodeGroupName'] = profile_name
205-
name_tag = f"xspot-controller-{profile_name}"
205+
name_tag = f"xio-controller-{profile_name}"
206206
name_tag_found = False
207207
for tag_dict in profile['Controller']['InstanceTags']:
208208
if tag_dict['Key'] == 'Name':
@@ -221,7 +221,7 @@ def configure_profile(self, profile_config, template_profile_config):
221221
for spot_fleet_type in profile_config['SpotFleetTypes']:
222222
profile['Worker']['SpotFleetTypes'].append(spot_fleet_type)
223223
name_tag_found = False
224-
name_tag = f"xspot-worker-{profile_name}"
224+
name_tag = f"xio-worker-{profile_name}"
225225
for tag_dict in profile['Worker']['InstanceTags']:
226226
if tag_dict['Key'] == 'Name':
227227
name_tag_found = True

source/resources/playbooks/roles/exostellar_workload_optimizer/files/opt/slurm/etc/exostellar/configure_xwo.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ def configure_profile(self, profile_name, profile_config, template_profile_confi
201201
# Set profile specific fields from the config
202202
profile['ProfileName'] = profile_name
203203
profile['NodeGroupName'] = profile_name
204-
name_tag = f"xspot-controller-{profile_name}"
204+
name_tag = f"xwo-controller-{profile_name}"
205205
name_tag_found = False
206206
for tag_dict in profile['Controller']['InstanceTags']:
207207
if tag_dict['Key'] == 'Name':
@@ -220,7 +220,7 @@ def configure_profile(self, profile_name, profile_config, template_profile_confi
220220
for spot_fleet_type in profile_config['SpotFleetTypes']:
221221
profile['Worker']['SpotFleetTypes'].append(spot_fleet_type)
222222
name_tag_found = False
223-
name_tag = f"xspot-worker-{profile_name}"
223+
name_tag = f"xwo-worker-{profile_name}"
224224
for tag_dict in profile['Worker']['InstanceTags']:
225225
if tag_dict['Key'] == 'Name':
226226
name_tag_found = True

0 commit comments

Comments
 (0)