Skip to main content

SLURM Setup Guide

This document covers the SLURM installation and configuration for shared use of the cluster workstation.


System Overview

ComponentSpecification
CPUAMD Ryzen Threadripper PRO 9985WX — 64 cores / 128 threads @ up to 5.5 GHz
RAM256 GB (8× 32GB DIMMs)
GPUNVIDIA RTX PRO 6000 Blackwell — ~96 GB VRAM
StorageSamsung 990 PRO 4TB NVMe
OSUbuntu 24.04
Hostnamenode01

Prerequisites

Before starting, verify your hardware is detected correctly:

# CPU info
lscpu

# Memory
free -h

# GPU
nvidia-smi

# All hardware
sudo lshw -short

Step 1: Install SLURM Packages

sudo apt update
sudo apt install slurm-wlm slurm-wlm-doc munge libmunge-dev

Step 2: Configure Munge Authentication

Munge provides authentication between SLURM components.

# Generate munge key
sudo /usr/sbin/mungekey

# Fix permissions
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key

# Enable and start munge
sudo systemctl enable munge
sudo systemctl start munge

# Verify munge is working
munge -n | unmunge

If successful, you'll see output showing the credential was successfully decoded.


Step 3: Create SLURM Configuration Files

3.1 Main Configuration (slurm.conf)

sudo vim /etc/slurm/slurm.conf

Paste the following configuration:

# ==============================================================================
# SLURM Configuration for Exxact Workstation
# ==============================================================================

# Cluster identification
ClusterName=node01

# Controller configuration
SlurmctldHost=node01
SlurmUser=slurm

# ------------------------------------------------------------------------------
# Paths and PIDs
# ------------------------------------------------------------------------------
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld

# ------------------------------------------------------------------------------
# Logging
# ------------------------------------------------------------------------------
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info

# ------------------------------------------------------------------------------
# Process Tracking and Task Management
# ------------------------------------------------------------------------------
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

# ------------------------------------------------------------------------------
# Scheduling
# ------------------------------------------------------------------------------
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# ------------------------------------------------------------------------------
# GPU Support
# ------------------------------------------------------------------------------
GresTypes=gpu

# ------------------------------------------------------------------------------
# Job Defaults and Limits
# ------------------------------------------------------------------------------
DefMemPerCPU=2000
MaxJobCount=5000
MaxArraySize=10000

# Enforce time limit for better backfill scheduling
EnforcePartLimits=ALL

# ------------------------------------------------------------------------------
# Timeouts
# ------------------------------------------------------------------------------
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0

# ------------------------------------------------------------------------------
# Node Definition
# ------------------------------------------------------------------------------
# AMD Ryzen Threadripper PRO 9985WX (64 cores / 128 threads)
# 256 GB RAM
# 1x NVIDIA RTX PRO 6000 Blackwell (~96GB VRAM)
NodeName=node01 \
CPUs=128 \
Boards=1 \
SocketsPerBoard=1 \
CoresPerSocket=64 \
ThreadsPerCore=2 \
RealMemory=250000 \
Gres=gpu:rtx6000:1 \
State=UNKNOWN

# ------------------------------------------------------------------------------
# Partition
# ------------------------------------------------------------------------------
# Single partition - users should always specify --time for efficient scheduling
# Max 7 days, default 30 minutes if not specified
PartitionName=main \
Nodes=node01 \
Default=YES \
MaxTime=7-00:00:00 \
DefaultTime=00:30:00 \
State=UP

Key configuration notes:

  • SlurmUser=slurm — Required to avoid UID mismatch errors
  • RealMemory=250000 — Reserves ~6GB for OS overhead from 256GB total
  • DefaultTime=00:30:00 — Jobs without --time get 30 minutes (prevents infinite jobs)
  • MaxTime=7-00:00:00 — Maximum job duration is 7 days
  • Backfill scheduler automatically prioritizes shorter jobs when resources are available

3.2 GPU Configuration (gres.conf)

sudo vim /etc/slurm/gres.conf
# GPU: NVIDIA RTX PRO 6000 Blackwell
Name=gpu Type=rtx6000 File=/dev/nvidia0

:::tip Adding More GPUs When adding more GPUs to the same node, update gres.conf with additional entries and update the node definition in slurm.conf:

# Example for 6 GPUs
Name=gpu Type=rtx6000 File=/dev/nvidia0
Name=gpu Type=rtx6000 File=/dev/nvidia1
Name=gpu Type=rtx6000 File=/dev/nvidia2
Name=gpu Type=rtx6000 File=/dev/nvidia3
Name=gpu Type=rtx6000 File=/dev/nvidia4
Name=gpu Type=rtx6000 File=/dev/nvidia5

Then update the node definition in slurm.conf:

Gres=gpu:rtx6000:6

:::

3.3 Cgroup Configuration (cgroup.conf)

sudo vim /etc/slurm/cgroup.conf
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
warning

Do NOT include CgroupAutomount=yes — this option is defunct in Ubuntu 24.04's SLURM version and will cause errors.


Step 4: Create Required Directories

sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/spool/slurmctld
sudo mkdir -p /var/log/slurm

sudo chown slurm:slurm /var/spool/slurmd
sudo chown slurm:slurm /var/spool/slurmctld
sudo chown slurm:slurm /var/log/slurm

Step 5: Configure Hostname Resolution

Ensure the hostname resolves correctly:

hostname
getent hosts node01

If the hostname doesn't resolve, add it to /etc/hosts:

echo "127.0.1.1 node01" | sudo tee -a /etc/hosts
warning

The head node hostname must NOT resolve to 127.0.1.1 if you are running a multi-node cluster. Replace it with the real network IP instead. See Adding a Node for details.


Step 6: Start SLURM Services

# Enable and start controller
sudo systemctl enable slurmctld
sudo systemctl start slurmctld

# Enable and start compute daemon
sudo systemctl enable slurmd
sudo systemctl start slurmd

Step 7: Verify Installation

# Check node status
sinfo

# Expected output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# main* up 7-00:00:00 1 idle node01

# Check detailed node info
scontrol show node node01

# Test a simple job
srun hostname

# Test GPU access
srun --gres=gpu:1 nvidia-smi

Troubleshooting

Check Service Status and Logs

# Service status
sudo systemctl status slurmctld
sudo systemctl status slurmd
sudo systemctl status munge

# Recent logs
sudo tail -50 /var/log/slurm/slurmctld.log
sudo tail -50 /var/log/slurm/slurmd.log

Node is DOWN After Reboot

Symptom:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up 7-00:00:00 1 down node01

Cause: SLURM automatically marks nodes as DOWN after unexpected reboots as a safety precaution.

Solution: If the node is healthy, resume it:

sudo scontrol update nodename=node01 state=resume

Verify:

sinfo
srun hostname

"can't stat gres.conf file /dev/nvidia0: No such file or directory"

Symptom (in /var/log/slurm/slurmd.log):

error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

Cause: slurmd started before the NVIDIA driver finished loading during boot.

Immediate fix:

sudo systemctl restart slurmd
sudo scontrol update nodename=node01 state=resume

Permanent fix: Add a systemd dependency so slurmd waits for the NVIDIA driver:

sudo systemctl edit slurmd

Add:

[Unit]
After=nvidia-persistenced.service
Wants=nvidia-persistenced.service

"cred/munge: Unexpected uid"

Symptom:

error: cred/munge: Unexpected uid (64030) != Slurm uid (0)

Solution: Add SlurmUser=slurm to slurm.conf after SlurmctldHost.

"CgroupAutomount is defunct"

Symptom:

error: The option "CgroupAutomount" is defunct, please remove it from cgroup.conf.

Solution: Remove CgroupAutomount=yes from cgroup.conf.

"Header lengths are longer than data received"

Symptom:

srun: error: Task launch for StepId=X.0 failed on node: Header lengths are longer than data received

Solution: Usually indicates a version mismatch or cgroup issues. Verify all components are the same version:

slurmctld -V
slurmd -V
srun -V

Hostname Not Resolving

Solution: Add to /etc/hosts:

echo "127.0.1.1 $(hostname)" | sudo tee -a /etc/hosts

Restart Services After Configuration Changes

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Quick Reference

FileLocationPurpose
slurm.conf/etc/slurm/slurm.confMain SLURM configuration
gres.conf/etc/slurm/gres.confGPU resource definitions
cgroup.conf/etc/slurm/cgroup.confResource isolation settings
munge.key/etc/munge/munge.keyAuthentication key
DirectoryPurpose
/var/spool/slurmdSlurmd spool directory
/var/spool/slurmctldController state files
/var/log/slurm/SLURM log files
# Check cluster status
sinfo

# Show node details
scontrol show node node01

# Show partition details
scontrol show partition main

# View running/pending jobs
squeue

# Restart after config changes
sudo systemctl restart slurmctld && sudo systemctl restart slurmd