Skip to content

A fully containerized Slurm cluster using Docker Compose - complete with controller, compute nodes, accounting (SlurmDBD + MariaDB), and REST API support

License

Notifications You must be signed in to change notification settings

ManiAm/slurm-docker-cluster

Repository files navigation

Slurm Cluster in Docker

This project sets up a complete Slurm cluster using Docker containers for local development, experimentation, and testing purposes.

This is tested on "Ubuntu 20.04.6".

Components

The project structure looks like this:

slurm-docker-cluster/
├── Dockerfile
├── entrypoint.sh
├── slurm.conf
├── slurmdbd.conf
├── munge.key
├── docker-compose.yml

The Slurm cluster consists of:

  • 1 controller node (slurmctld)
  • 5 compute nodes (slurmd)
  • 1 SlurmDBD node (slurmdbd)
  • 1 MariaDB node for accounting backend
  • 1 REST API node (slurmrestd) to interact with the cluster via REST

The /shared directory is a shared volume mounted across all nodes in the Slurm cluster. It is used to share configuration files, binaries, and other data that need to be accessible from multiple nodes.

Authentication

MUNGE is a lightweight authentication service used by Slurm to securely verify users across nodes. All nodes in the cluster need to share the same MUNGE key (usually at /etc/munge/munge.key). It ensures that jobs submitted from one node are trusted and accepted by the controller.

Install the munge package on the host:

sudo apt update
sudo apt install munge

Generate a munge key:

cd slurm-docker-cluster/
sudo ./create-munge-key

Copy the key to the current project directory:

sudo cp /etc/munge/munge.key ./munge.key

Set the correct ownership for munge.key:

sudo chown 999:999 munge.key

Build and Lunch

Set the correct ownership and permission for slurmdbd.conf:

sudo chown 999:999 slurmdbd.conf
sudo chmod 600 slurmdbd.conf

Build the Docker image:

docker build --build-arg SLURM_VERSION=24.11.3 -t slurm-base .

Start all the containers:

docker compose up -d

Open an interactive shell to the controller node:

docker exec -it slurm-controller bash

Display the current state of nodes and partitions in the cluster:

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE  NODELIST
debug*       up  1:00:00      2    idle   compute[1-2]
batch        up  1-00:00:00   2    idle   compute[3-4]
gpu          up  2-00:00:00   1    idle   compute5
all          up  infinite     5    idle   compute[1-5]

Interactive Job

Open an interactive shell to the controller node:

docker exec -it slurm-controller bash

Request an interactive allocation:

salloc --partition=debug --nodes=2 --time=01:00:00

salloc: Granted job allocation 1
salloc: Nodes compute[1-2] are ready for job

Create a Python script (hello.py):

nano hello.py
#!/usr/bin/env python3

import socket
print(f"Hello from {socket.gethostname()}")

This script prints the hostname of the node it's running on - a nice way to verify it's distributed correctly.

Make it executable:

chmod 755 hello.py

Distribute the script to all compute nodes using sbcast:

sbcast hello.py /tmp/hello.py

This sends your local hello.py file to /tmp/hello.py on both compute nodes, so each task can access it locally.

sbcast is much faster and more efficient than using scp or a shared filesystem for small files in a distributed job.

Run the script across all allocated nodes:

srun /tmp/hello.py

Hello from compute2
Hello from compute1

Batch Job

Create a job script hello_job.sh:

nano hello_job.sh
#!/bin/bash
#SBATCH --job-name=hello_job
#SBATCH --output=hello_output.txt
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --partition=debug

echo "Hello from $(hostname)"

Job script is a shell script with SBATCH directives used to submit batch jobs.

Submit it with sbatch:

sbatch hello_job.sh

Submitted batch job 25

You can check your job status with:

sacct -j 25 --format=JobID,JobName,State,ExitCode

JobID           JobName      State ExitCode
------------ ---------- ---------- --------
25            hello_job  COMPLETED      0:0
25.batch          batch  COMPLETED      0:0

The output file is written by the node that executes the job.

Open an interactive shell to compute1:

docker exec -it bash compute1

And check the output file:

ls -l /root

-rw-r--r-- 1 root root 20 Apr 11 23:21 hello_output.txt

Job Enforcement

By default, Slurm allocates resources like CPUs and memory based on job requests but does not strictly prevent a job from exceeding these limits. This means a job can potentially use more CPUs or memory than requested if the system allows it, which can impact other jobs on the same node.

To ensure strict enforcement, administrators must enable and configure Linux control groups (cgroups) via slurm.conf and cgroup.conf. This allows Slurm to constrain CPU usage through cpusets and enforce memory limits, terminating jobs that exceed their allocations.

To enable cgroups, we need to edit the slurm.conf and ensure the following line is present:

TaskPlugin=task/cgroup

Create a file called /etc/slurm/cgroup.conf on all nodes (controller and compute), with content like this:

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=no

Let us go over an example on how cgroups works in practice on Slurm.

Open an interactive shell to compute1:

docker exec -it compute1 bash

Install htop and stress packages:

apt update && apt install htop stress -y

Invoke the stress test in the background that spawns 4 CPU workers:

stress --cpu 4 --timeout 30 &

Open htop to confirm four CPUs are busy:

htop

segment

Open an interactive shell to the controller:

docker exec -it slurm-controller bash

And invoke the same stress test, but through Slurm:

sbatch --cpus-per-task=1 --wrap="stress --cpu 4 --timeout 30"

Slurm, with cgroup enforcement enabled, does the following:

  • Allocates only 1 CPU core to the job.
  • Creates a cpuset cgroup that limits which CPU(s) the job can use.
  • Even though stress spawns 4 processes, the kernel scheduler ensures that only 1 core is allowed for execution.

So in htop, you'll still see 4 process, but only one will be consuming CPU.

The others will be throttled/stalled due to the cgroup constraint.

segment

Slurm and MPI

MPI (Message Passing Interface) is a standardized and portable communication protocol used to program parallel applications that run across multiple nodes. It allows processes to communicate with one another by sending and receiving messages, making it ideal for high-performance computing (HPC) tasks.

Slurm can integrate with MPI to run distributed parallel applications across multiple nodes in a cluster. Slurm handles resource allocation and job scheduling, while MPI handles inter-process communication. MpiDefault parameter tells Slurm which MPI type to use by default when launching jobs with srun:

Value Description
none No special support for MPI. Slurm will not handle MPI-specific startup tasks.
openmpi Legacy OpenMPI support (rarely needed with newer versions).
pmi2 Use PMI2 interface (common with OpenMPI and MPICH).
hydra For Intel MPI or MPICH with Hydra process manager.
cray_shasta Special plugin for Cray Shasta systems.
pmix Use PMIx interface (more scalable and modern).

OpenMPI is a popular open-source implementation of the MPI standard, providing tools and libraries that support a variety of platforms and interconnects. It is widely used in research and industry for building scalable applications that require efficient communication among distributed processes. Let's walk through an example to demonstrate how OpenMPI can be used within a Slurm-managed environment.

Open an interactive shell to the head node:

docker exec -it slurm-controller bash

From the debug partition, request two physical node for an hour:

salloc --partition=debug --nodes=2 --time=01:00:00 --job-name=mpi-testing

salloc: Granted job allocation 1
salloc: Nodes compute[1-2] are ready for job

Install OpenMPI packages on all reserved compute nodes:

srun bash -c 'apt-get update && apt install openmpi-bin openmpi-common libopenmpi-dev -y'

Open a shell on compute1:

srun --nodelist=compute1 --pty bash

Go to the shared folder that is accessible across all Slurm cluster:

cd /shared

Create hello_mpi.c file:

nano hello_mpi.c

With this content:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int node, total;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &node);
    MPI_Comm_size(MPI_COMM_WORLD, &total);
    printf("Hello World from Node %d of %d!\n", node, total);
    MPI_Finalize();
    return 0;
}

Compile the MPI program:

mpicc hello_mpi.c -o hello_mpi

This produces an executable called hello_mpi.

Return to the controller node:

exit

Instead of using mpirun, Slurm recommends using srun to launch MPI programs. It enables better job tracking, process binding, and scalability through direct integration with process management interfaces like PMI and PMIx.

Invoke the hello_mpi program:

srun /shared/hello_mpi

Hello World from Node 1 of 2!
Hello World from Node 0 of 2!

Slurm REST

We are exposing slurmrestd on port 6820, so REST requests should go to:

http://localhost:6820

We must generate a JWT token for REST API:

docker exec -it slurmrestd bash
/usr/bin/scontrol token username=root lifespan=31536000

Lifespan is in seconds and we set it to 1 year:

365 days/year × 24 hours/day × 60 minutes/hour × 60 seconds/minute = 31,536,000 seconds

Then you can send a REST request from the host such as:

curl http://localhost:6820/slurm/v0.0.40/nodes \
-H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NzU3MDUwMDksImlhdCI6MTc0NDE2OTAwOSwic3VuIjoicm9vdCJ9.gI-Ij2ZIOYlm4mCoKZVYWExRKJc8G6sXJeiqxnXAkFk"

About

A fully containerized Slurm cluster using Docker Compose - complete with controller, compute nodes, accounting (SlurmDBD + MariaDB), and REST API support

Topics

Resources

License

Stars

Watchers

Forks