SLURM Setup Guide

This document covers the SLURM installation and configuration for shared use of the cluster workstation.

System Overview

Component	Specification
CPU	AMD Ryzen Threadripper PRO 9985WX — 64 cores / 128 threads @ up to 5.5 GHz
RAM	256 GB (8× 32GB DIMMs)
GPU	2× NVIDIA RTX PRO 6000 Blackwell Max-Q — ~96 GB VRAM each (~192 GB total)
Storage	Samsung 990 PRO 4TB NVMe
OS	Ubuntu 24.04
Hostname	node01

Prerequisites

Before starting, verify your hardware is detected correctly:

# CPU info
lscpu

# Memory
free -h

# GPU
nvidia-smi

# All hardware
sudo lshw -short

Step 1: Install SLURM Packages

sudo apt update
sudo apt install slurm-wlm slurm-wlm-doc munge libmunge-dev

Step 2: Configure Munge Authentication

Munge provides authentication between SLURM components.

# Generate munge key
sudo /usr/sbin/mungekey

# Fix permissions
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key

# Enable and start munge
sudo systemctl enable munge
sudo systemctl start munge

# Verify munge is working
munge -n | unmunge

If successful, you'll see output showing the credential was successfully decoded.

Step 3: Create SLURM Configuration Files

3.1 Main Configuration (`slurm.conf`)

sudo vim /etc/slurm/slurm.conf

Paste the following configuration:

# ==============================================================================
# SLURM Configuration for Exxact Workstation
# ==============================================================================

# Cluster identification
ClusterName=node01

# Controller configuration
SlurmctldHost=node01
SlurmUser=slurm

# ------------------------------------------------------------------------------
# Paths and PIDs
# ------------------------------------------------------------------------------
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld

# ------------------------------------------------------------------------------
# Logging
# ------------------------------------------------------------------------------
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info

# ------------------------------------------------------------------------------
# Process Tracking and Task Management
# ------------------------------------------------------------------------------
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

# ------------------------------------------------------------------------------
# Scheduling
# ------------------------------------------------------------------------------
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# ------------------------------------------------------------------------------
# GPU Support
# ------------------------------------------------------------------------------
GresTypes=gpu

# ------------------------------------------------------------------------------
# Job Defaults and Limits
# ------------------------------------------------------------------------------
DefMemPerCPU=2000
MaxJobCount=5000
MaxArraySize=10000

# Enforce time limit for better backfill scheduling
EnforcePartLimits=ALL

# ------------------------------------------------------------------------------
# Timeouts
# ------------------------------------------------------------------------------
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0

# ------------------------------------------------------------------------------
# Node Definition
# ------------------------------------------------------------------------------
# AMD Ryzen Threadripper PRO 9985WX (64 cores / 128 threads)
# 256 GB RAM
# 1x NVIDIA RTX PRO 6000 Blackwell (~96GB VRAM)
NodeName=node01 \
      CPUs=128 \
      Boards=1 \
      SocketsPerBoard=1 \
      CoresPerSocket=64 \
      ThreadsPerCore=2 \
      RealMemory=250000 \
      Gres=gpu:rtx6000:1 \
      State=UNKNOWN

# ------------------------------------------------------------------------------
# Partition
# ------------------------------------------------------------------------------
# Single partition - users should always specify --time for efficient scheduling
# Max 7 days, default 30 minutes if not specified
PartitionName=main \
      Nodes=node01 \
      Default=YES \
      MaxTime=7-00:00:00 \
      DefaultTime=00:30:00 \
      State=UP

Key configuration notes:

SlurmUser=slurm — Required to avoid UID mismatch errors
RealMemory=250000 — Reserves ~6GB for OS overhead from 256GB total
DefaultTime=00:30:00 — Jobs without --time get 30 minutes (prevents infinite jobs)
MaxTime=7-00:00:00 — Maximum job duration is 7 days
Backfill scheduler automatically prioritizes shorter jobs when resources are available

note

The slurm.conf snippet above reflects the original single-GPU setup. The node has since been expanded to 2 GPUs — see Adding a Second GPU for the updated Gres= line and full reconfiguration walkthrough.

3.2 GPU Configuration (`gres.conf`)

sudo vim /etc/slurm/gres.conf

# GPU: NVIDIA RTX PRO 6000 Blackwell
Name=gpu Type=rtx6000 File=/dev/nvidia0

Adding More GPUs

When adding more GPUs to the same node, update gres.conf with additional entries and update the node definition in slurm.conf:

# Example for 6 GPUs
Name=gpu Type=rtx6000 File=/dev/nvidia0
Name=gpu Type=rtx6000 File=/dev/nvidia1
Name=gpu Type=rtx6000 File=/dev/nvidia2
Name=gpu Type=rtx6000 File=/dev/nvidia3
Name=gpu Type=rtx6000 File=/dev/nvidia4
Name=gpu Type=rtx6000 File=/dev/nvidia5

Then update the node definition in slurm.conf:

Gres=gpu:rtx6000:6

For a real-world walkthrough of this exact process — including a stale Gres-count node drain and how it was resolved, plus the matching Grafana dashboard updates — see Adding a Second GPU.

3.3 Cgroup Configuration (`cgroup.conf`)

sudo vim /etc/slurm/cgroup.conf

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes

warning

Do NOT include CgroupAutomount=yes — this option is defunct in Ubuntu 24.04's SLURM version and will cause errors.

Step 4: Create Required Directories

sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/spool/slurmctld
sudo mkdir -p /var/log/slurm

sudo chown slurm:slurm /var/spool/slurmd
sudo chown slurm:slurm /var/spool/slurmctld
sudo chown slurm:slurm /var/log/slurm

Step 5: Configure Hostname Resolution

Ensure the hostname resolves correctly:

hostname
getent hosts node01

If the hostname doesn't resolve, add it to /etc/hosts:

echo "127.0.1.1 node01" | sudo tee -a /etc/hosts

warning

The head node hostname must NOT resolve to 127.0.1.1 if you are running a multi-node cluster. Replace it with the real network IP instead. See Adding a Node for details.

Step 6: Start SLURM Services

# Enable and start controller
sudo systemctl enable slurmctld
sudo systemctl start slurmctld

# Enable and start compute daemon
sudo systemctl enable slurmd
sudo systemctl start slurmd

Step 7: Verify Installation

# Check node status
sinfo

# Expected output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# main*     up    7-00:00:00    1  idle node01

# Check detailed node info
scontrol show node node01

# Test a simple job
srun hostname

# Test GPU access
srun --gres=gpu:1 nvidia-smi

Troubleshooting

Check Service Status and Logs

# Service status
sudo systemctl status slurmctld
sudo systemctl status slurmd
sudo systemctl status munge

# Recent logs
sudo tail -50 /var/log/slurm/slurmctld.log
sudo tail -50 /var/log/slurm/slurmd.log

Node is DOWN After Reboot

Symptom:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main*     up    7-00:00:00    1  down node01

Cause: SLURM automatically marks nodes as DOWN after unexpected reboots as a safety precaution.

Solution: If the node is healthy, resume it:

sudo scontrol update nodename=node01 state=resume

Verify:

sinfo
srun hostname

"can't stat gres.conf file /dev/nvidia0: No such file or directory"

Symptom (in /var/log/slurm/slurmd.log):

error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

Cause: slurmd started before the NVIDIA driver finished loading during boot.

Immediate fix:

sudo systemctl restart slurmd
sudo scontrol update nodename=node01 state=resume

Permanent fix: Add a systemd dependency so slurmd waits for the NVIDIA driver:

sudo systemctl edit slurmd

Add:

[Unit]
After=nvidia-persistenced.service
Wants=nvidia-persistenced.service

"cred/munge: Unexpected uid"

Symptom:

error: cred/munge: Unexpected uid (64030) != Slurm uid (0)

Solution: Add SlurmUser=slurm to slurm.conf after SlurmctldHost.

"CgroupAutomount is defunct"

Symptom:

error: The option "CgroupAutomount" is defunct, please remove it from cgroup.conf.

Solution: Remove CgroupAutomount=yes from cgroup.conf.

"Header lengths are longer than data received"

Symptom:

srun: error: Task launch for StepId=X.0 failed on node: Header lengths are longer than data received

Solution: Usually indicates a version mismatch or cgroup issues. Verify all components are the same version:

slurmctld -V
slurmd -V
srun -V

Hostname Not Resolving

Solution: Add to /etc/hosts:

echo "127.0.1.1 $(hostname)" | sudo tee -a /etc/hosts

Restart Services After Configuration Changes

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Quick Reference

File	Location	Purpose
`slurm.conf`	`/etc/slurm/slurm.conf`	Main SLURM configuration
`gres.conf`	`/etc/slurm/gres.conf`	GPU resource definitions
`cgroup.conf`	`/etc/slurm/cgroup.conf`	Resource isolation settings
`munge.key`	`/etc/munge/munge.key`	Authentication key

Directory	Purpose
`/var/spool/slurmd`	Slurmd spool directory
`/var/spool/slurmctld`	Controller state files
`/var/log/slurm/`	SLURM log files

# Check cluster status
sinfo

# Show node details
scontrol show node node01

# Show partition details
scontrol show partition main

# View running/pending jobs
squeue

# Restart after config changes
sudo systemctl restart slurmctld && sudo systemctl restart slurmd

System Overview​

Prerequisites​

Step 1: Install SLURM Packages​

Step 2: Configure Munge Authentication​

Step 3: Create SLURM Configuration Files​

3.1 Main Configuration (slurm.conf)​

3.2 GPU Configuration (gres.conf)​

3.3 Cgroup Configuration (cgroup.conf)​

Step 4: Create Required Directories​

Step 5: Configure Hostname Resolution​

Step 6: Start SLURM Services​

Step 7: Verify Installation​

Troubleshooting​

Check Service Status and Logs​

Node is DOWN After Reboot​

"can't stat gres.conf file /dev/nvidia0: No such file or directory"​

"cred/munge: Unexpected uid"​

"CgroupAutomount is defunct"​

"Header lengths are longer than data received"​

Hostname Not Resolving​

Restart Services After Configuration Changes​

Quick Reference​

System Overview

Prerequisites

Step 1: Install SLURM Packages

Step 2: Configure Munge Authentication

Step 3: Create SLURM Configuration Files

3.1 Main Configuration (`slurm.conf`)

3.2 GPU Configuration (`gres.conf`)

3.3 Cgroup Configuration (`cgroup.conf`)

Step 4: Create Required Directories

Step 5: Configure Hostname Resolution

Step 6: Start SLURM Services

Step 7: Verify Installation

Troubleshooting

Check Service Status and Logs

Node is DOWN After Reboot

"can't stat gres.conf file /dev/nvidia0: No such file or directory"

"cred/munge: Unexpected uid"

"CgroupAutomount is defunct"

"Header lengths are longer than data received"

Hostname Not Resolving

Restart Services After Configuration Changes

Quick Reference