Adding a Compute Node to the SLURM Cluster

This document provides a step-by-step guide for adding a new compute node to the existing SLURM cluster.

Architecture Overview

┌────────────────────────────┐     ┌────────────────────────────┐
│      Control Node          │     │      New Compute Node      │
│                            │     │                            │
│  slurmctld (controller)    │<───>│  slurmd (compute)          │
│  slurmd (compute)          │     │                            │
│  munge (auth)              │     │  munge (auth)              │
└────────────────────────────┘     └────────────────────────────┘
      Users SSH here                  Jobs scheduled here
      to submit jobs                  by the controller

The control node runs both slurmctld (controller) and slurmd (compute). New nodes run only slurmd.

File	Same on All Nodes?
`slurm.conf`	Yes — must be identical
`cgroup.conf`	Yes — must be identical
`gres.conf`	No — different per node (matches local GPU)

Step 1: Set Hostname

On the new node:

sudo hostnamectl set-hostname <new_hostname>

Step 2: Network Configuration

Both machines must resolve each other by hostname to real network IPs (not loopback).

Update `/etc/hosts` on BOTH Machines

sudo vim /etc/hosts

Add entries for all cluster nodes:

192.168.220.75  node01  # control node
192.168.220.76  node02  # compute node

warning

The control node hostname (node01) must NOT resolve to 127.0.1.1. Replace any 127.0.1.1 node01 line with the real network IP.

Verify Connectivity from BOTH Sides

ping node01
ping node02
ssh node01
ssh node02
hostname -f

Step 3: Match User Accounts

SLURM requires identical UIDs and GIDs on all nodes. Every user that submits or runs jobs must exist on both machines with the same numeric IDs.

Check Existing UIDs on the Control Node

for u in user1 user2 user3 slurm; do
    id $u 2>/dev/null
done

Create Groups on the New Node

sudo groupadd -g 1005 <group1>
sudo groupadd -g 1006 <group2>

Create Users on the New Node

sudo useradd -u 1001 -g 1006 -m -s /bin/bash -c "User One" user1
sudo useradd -u 1002 -g 1006 -m -s /bin/bash -c "User Two" user2
sudo useradd -u 1003 -g 1005 -m -s /bin/bash -c "User Three" user3

Create the SLURM System User

sudo groupadd -g 64030 slurm
sudo useradd -u 64030 -g 64030 -r -s /usr/sbin/nologin slurm

note

The useradd warning about UID exceeding SYS_UID_MAX is harmless — ignore it.

Handle UID Conflicts

If an existing local user on the new node has a UID in the 1001–1009 range, move them first:

# Example: move user 'localuser' from UID 1001 to 2001
sudo usermod -u 2001 localuser
sudo groupmod -g 2001 localuser
sudo find / -uid 1001 -exec chown 2001 {} + 2>/dev/null
sudo find / -gid 1001 -exec chgrp 2001 {} + 2>/dev/null

Verify — Run on BOTH Machines and Compare

for u in user1 user2 user3 slurm; do
    id $u 2>/dev/null
done

Output must be identical (UIDs, primary GIDs). Secondary groups like sudo, docker can differ.

note

The steps above cover bulk-creating all existing users when first provisioning a new node. For the ongoing workflow of adding a single new user to an already-running two-node cluster, see User Management — Syncing User Accounts Between node01 and node02.

Step 4: Install SLURM on the New Node

sudo apt update
sudo apt install -y slurmd slurm-client munge

warning

Do NOT install slurmctld — only the control node needs the controller.

Step 5: Sync Munge Key

SLURM uses Munge for authentication. Both nodes must share the same key.

Export Key from Control Node

# On control node
sudo cat /etc/munge/munge.key | base64

Import Key on New Node

# On new node
sudo systemctl stop munge

echo "<paste_base64_key_as_one_line>" | base64 -d | sudo tee /etc/munge/munge.key > /dev/null

sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl enable --now munge

Test

# From new node
munge -n | ssh node01 unmunge

Expected output: STATUS: Success (0)

Step 6: Update `slurm.conf`

Edit /etc/slurm/slurm.conf on the control node to add the new node definition.

Gather Specs from the New Node First

lscpu | grep -E "^CPU\(s\)|Core|Thread|Socket|Model name"
free -h | grep Mem
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

Add the New Node Definition

# Node 2: node02 (compute only)
# AMD Ryzen 9 9950X3D (16 cores / 32 threads)
# 128 GB RAM
# 1x NVIDIA RTX 5000 Ada Generation (~32GB VRAM)
NodeName=node02 \
      CPUs=32 \
      Boards=1 \
      SocketsPerBoard=1 \
      CoresPerSocket=16 \
      ThreadsPerCore=2 \
      RealMemory=120000 \
      Gres=gpu:rtx5000:1,shard:2 \
      State=UNKNOWN

Update the Partition to Include the New Node

PartitionName=main \
      Nodes=node01,node02 \
      Default=YES \
      MaxTime=7-00:00:00 \
      DefaultTime=00:30:00 \
      State=UP

Update GPU Support Section

GresTypes=gpu,shard

Intel Hybrid CPU Note

If the new node uses an Intel hybrid architecture (P-cores + E-cores), slurmd -C may auto-detect fewer CPUs than available. Add this line under TaskPlugin:

SlurmdParameters=config_overrides

This tells slurmd to trust slurm.conf values instead of auto-detection.

Step 7: Create `gres.conf` on the New Node

Each node has its own gres.conf matching its local GPU. Create /etc/slurm/gres.conf on the new node:

# Full GPU - exclusive access
Name=gpu Type=rtx5000 File=/dev/nvidia0

# GPU Shards - shared access
# 2 shards available, each represents ~16GB VRAM (32GB / 2)
Name=shard Count=2 File=/dev/nvidia0

note

Adjust the GPU type, shard count, and VRAM per shard based on the actual GPU installed on the new node.

Step 8: Copy Configs to the New Node

slurm.conf and cgroup.conf must be identical on all nodes.

# From control node
scp /etc/slurm/slurm.conf <new_node>:/tmp/
scp /etc/slurm/cgroup.conf <new_node>:/tmp/

On the new node:

sudo cp /tmp/slurm.conf /etc/slurm/slurm.conf
sudo cp /tmp/cgroup.conf /etc/slurm/cgroup.conf

Verify cgroup.conf exists and matches:

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes

Step 9: Create Required Directories on the New Node

sudo mkdir -p /var/spool/slurmd /var/log/slurm
sudo chown slurm:slurm /var/spool/slurmd /var/log/slurm

Step 10: Start Services

Start the new node first, then restart the control node:

# On new node
sudo systemctl enable --now slurmd

# On control node
sudo systemctl daemon-reload
sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Step 11: Verify the Cluster

# Check both nodes are visible
sinfo -N -l

# Test job on each node
srun -w node01 hostname
srun -w node02 hostname

# Test GPU on new node
srun -w node02 --gres=gpu:1 nvidia-smi

# Test shard on new node
srun -w node02 --gres=shard:1 nvidia-smi

# Let SLURM pick the node
srun --gres=shard:1 hostname

Expected sinfo output:

NODELIST  NODES  PARTITION  STATE  CPUS  S:C:T    MEMORY  TMP_DISK  WEIGHT  AVAIL_FE  REASON
node01    1      main*      idle   128   1:64:2   250000  0         1       (null)    none
node02    1      main*      idle   32    1:16:2   120000  0         1       (null)    none

Removing a Compute Node

Stop `slurmd` on the Compute Node

sudo systemctl stop slurmd
sudo systemctl disable slurmd

Update `slurm.conf` on the Control Node

Remove the node's NodeName=... block
Remove the node from the Nodes= list in the partition
Copy updated slurm.conf to all remaining nodes

Restart SLURM on the Control Node

sudo systemctl daemon-reload
sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Clear Stale Node State (if Grafana/sinfo Still Shows Old Node)

sudo systemctl stop slurmctld
sudo rm /var/spool/slurmctld/node_state
sudo systemctl start slurmctld

Fully Remove SLURM from the Compute Node (Optional)

sudo apt remove --purge slurmd slurm-client munge
sudo rm -rf /etc/slurm /var/spool/slurmd /var/log/slurm /etc/munge

Troubleshooting

`slurmd` Fails to Start — "DNS SRV lookup failed"

The slurm.conf file is missing or misnamed on the compute node. Verify:

ls -la /etc/slurm/slurm.conf

GPU Shows "No devices were found"

Verify nvidia-smi works outside SLURM
Check gres.conf matches the actual GPU device (/dev/nvidia0)
Check if slurmd -C detects fewer CPUs than configured (Intel hybrid CPU issue) — add SlurmdParameters=config_overrides to slurm.conf

`slurmd` Detects Wrong CPU Count

Common with Intel hybrid architectures (P-cores + E-cores). Add to slurm.conf:

SlurmdParameters=config_overrides

Munge Authentication Fails

Ensure both nodes have the same munge key:

# On control node
sudo md5sum /etc/munge/munge.key

# On compute node
sudo md5sum /etc/munge/munge.key

If they differ, re-copy the key from the control node.

Node Shows as "down" in `sinfo`

# Check why
scontrol show node <nodename> | grep Reason

# Resume the node
sudo scontrol update NodeName=<nodename> State=resume

Architecture Overview​

Step 1: Set Hostname​

Step 2: Network Configuration​

Update /etc/hosts on BOTH Machines​

Verify Connectivity from BOTH Sides​

Step 3: Match User Accounts​

Check Existing UIDs on the Control Node​

Create Groups on the New Node​

Create Users on the New Node​

Create the SLURM System User​

Handle UID Conflicts​

Verify — Run on BOTH Machines and Compare​

Step 4: Install SLURM on the New Node​

Step 5: Sync Munge Key​

Export Key from Control Node​

Import Key on New Node​

Test​

Step 6: Update slurm.conf​

Gather Specs from the New Node First​

Add the New Node Definition​

Update the Partition to Include the New Node​

Update GPU Support Section​

Intel Hybrid CPU Note​

Step 7: Create gres.conf on the New Node​

Step 8: Copy Configs to the New Node​

Step 9: Create Required Directories on the New Node​

Step 10: Start Services​

Step 11: Verify the Cluster​

Removing a Compute Node​

Stop slurmd on the Compute Node​

Update slurm.conf on the Control Node​

Restart SLURM on the Control Node​

Clear Stale Node State (if Grafana/sinfo Still Shows Old Node)​

Fully Remove SLURM from the Compute Node (Optional)​

Troubleshooting​

slurmd Fails to Start — "DNS SRV lookup failed"​

GPU Shows "No devices were found"​

slurmd Detects Wrong CPU Count​

Munge Authentication Fails​

Node Shows as "down" in sinfo​

Architecture Overview

Step 1: Set Hostname

Step 2: Network Configuration

Update `/etc/hosts` on BOTH Machines

Verify Connectivity from BOTH Sides

Step 3: Match User Accounts

Check Existing UIDs on the Control Node

Create Groups on the New Node

Create Users on the New Node

Create the SLURM System User

Handle UID Conflicts

Verify — Run on BOTH Machines and Compare

Step 4: Install SLURM on the New Node

Step 5: Sync Munge Key

Export Key from Control Node

Import Key on New Node

Test

Step 6: Update `slurm.conf`

Gather Specs from the New Node First

Add the New Node Definition

Update the Partition to Include the New Node

Update GPU Support Section

Intel Hybrid CPU Note

Step 7: Create `gres.conf` on the New Node

Step 8: Copy Configs to the New Node

Step 9: Create Required Directories on the New Node

Step 10: Start Services

Step 11: Verify the Cluster

Removing a Compute Node

Stop `slurmd` on the Compute Node

Update `slurm.conf` on the Control Node

Restart SLURM on the Control Node

Clear Stale Node State (if Grafana/sinfo Still Shows Old Node)

Fully Remove SLURM from the Compute Node (Optional)

Troubleshooting

`slurmd` Fails to Start — "DNS SRV lookup failed"

GPU Shows "No devices were found"

`slurmd` Detects Wrong CPU Count

Munge Authentication Fails

Node Shows as "down" in `sinfo`