Skip to main content

SLURM User Guide

This guide covers how to use SLURM to submit and manage compute jobs on the cluster.


Cluster Overview

The cluster consists of two nodes:

ResourceControl Node (node01)Compute Node (node02)
CPUThreadripper PRO 9985WX — 128 threadsRyzen 9 9950X3D — 32 threads
RAM256 GB128 GB
GPURTX PRO 6000 (~96 GB VRAM)RTX 5000 Ada (~32 GB VRAM)
Shards8 shards (~12 GB each)2 shards (~16 GB each)
Feature taglargesmall
SettingValue
Max job time7 days
Default job time30 minutes

GPU Allocation Policy

The GPU can be allocated in two ways:

AllocationCommandUse Case
Shared (default)--gres=shard:NDevelopment, inference, small training jobs
Exclusive--gres=gpu:1Large training jobs requiring full GPU

Shards per Node

NodeGPUShardsVRAM per Shard
node01RTX PRO 60008 shards~12 GB each
node02RTX 5000 Ada2 shards~16 GB each
warning

SLURM does not enforce VRAM limits per shard. If you exceed your allocation, your job may crash or affect other users. Be a good citizen!

GPU Request Guidelines

node01 (RTX PRO 6000 — 96GB):

VRAM NeededRequest
< 12 GB--gres=shard:1
12–24 GB--gres=shard:2
24–48 GB--gres=shard:4
48–96 GB--gres=gpu:1 (full GPU)

node02 (RTX 5000 Ada — 32GB):

VRAM NeededRequest
< 16 GB--gres=shard:1
16–32 GB--gres=gpu:1 (full GPU)

Node Features and Constraints

Each node is tagged with a feature label to help target jobs:

NodeFeatureUse For
node01largeJobs needing more VRAM, CPU threads, or RAM
node02smallLighter inference, testing, smaller training runs

By default, SLURM auto-schedules jobs across both nodes based on your resource request. Use --constraint only when your job genuinely requires a specific node.

# Run on the larger node (node01)
#SBATCH --constraint=large

# Run on the smaller node (node02)
#SBATCH --constraint=small
tip

For most jobs, omit --constraint entirely and let SLURM decide. If you request --gres=shard:4, SLURM will automatically avoid node02 (which only has 2 shards) — no constraint needed.


srun vs sbatch

CommandUse CaseBehavior
srunInteractive jobs, quick testsBlocks terminal until job completes
sbatchProduction jobs, long runsSubmits and returns immediately

srun — Interactive Jobs

Run commands directly on the cluster. Your terminal waits for the job to finish.

Basic Usage

# Run a simple command
srun hostname

# Run a Python script
srun python train.py

# Start an interactive shell
srun --pty bash

Requesting Resources

# Request 4 CPUs and 16GB memory for 30 minutes
srun --cpus-per-task=4 --mem=16G --time=00:30:00 python train.py

# Request shared GPU (2 shards, ~24GB VRAM)
srun --gres=shard:2 nvidia-smi

# Request full GPU (exclusive access)
srun --gres=gpu:1 nvidia-smi

# Combine CPU, memory, shared GPU, and time
srun --cpus-per-task=8 --mem=32G --gres=shard:2 --time=02:00:00 python train.py

Interactive GPU Session

# Get a shell with shared GPU access (2 shards) for 2 hours
srun --gres=shard:2 --mem=32G --time=02:00:00 --pty bash

# Now you can run commands interactively
nvidia-smi
python train.py
exit # release resources when done

# For large jobs needing full GPU
srun --gres=gpu:1 --mem=64G --time=04:00:00 --pty bash

sbatch — Batch Jobs

Submit jobs that run in the background. Use this for production workloads.

Basic Job Script

Create a file called job.sh:

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G

# Your commands here
echo "Job started on $(hostname)"
python train.py
echo "Job finished"

Submit it:

sbatch job.sh

GPU Job Script (Shared GPU)

For most GPU jobs, use shards:

#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=shard:2

# Load environment
source ~/.bashrc
conda activate myenv

# Run training
python train.py --epochs 100 --batch-size 32
echo "Training complete"

GPU Job Script (Full GPU)

For large models requiring full GPU:

#!/bin/bash
#SBATCH --job-name=large_training
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=48:00:00
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --constraint=large
#SBATCH --gres=gpu:1

# Load environment
source ~/.bashrc
conda activate myenv

# Run large model training
python train.py --model large --batch-size 128
echo "Training complete"

Submit:

mkdir -p logs # create logs directory first
sbatch job.sh

Passing Arguments to Job Scripts

#!/bin/bash
#SBATCH --job-name=experiment
#SBATCH --output=logs/%x_%j.log
#SBATCH --time=04:00:00
#SBATCH --gres=gpu:1

# $1, $2, etc. are command-line arguments
python train.py --lr $1 --epochs $2

Submit with arguments:

sbatch job.sh 0.001 50

Common #SBATCH Options

OptionDescriptionExample
--job-nameJob name (shows in queue)--job-name=training
--outputStdout file (%j=job ID, %x=job name)--output=logs/%x_%j.log
--errorStderr file--error=logs/%x_%j.err
--timeTime limit (HH:MM:SS or D-HH:MM:SS)--time=04:00:00
--cpus-per-taskNumber of CPU threads--cpus-per-task=8
--memTotal memory--mem=32G
--gres=shard:NShared GPU (N shards)--gres=shard:2
--gres=gpu:1Full GPU (exclusive access)--gres=gpu:1
--constraintTarget node by capability--constraint=large

Monitoring Jobs

Check Queue Status

# View all jobs
squeue

# View only your jobs
squeue -u $USER

# Detailed job info
squeue -l

Check Cluster Status

# Node availability
sinfo

# Detailed view with node list
sinfo -N -l

# Detailed node info (includes Feature tags)
scontrol show node node01
scontrol show node node02

Check Job Details

# While job is running or pending
scontrol show job <job_id>

# After job completes (accounting info)
# Note: This feature is currently turned off. Please request if you really need it.
sacct -j <job_id> --format=JobID,JobName,Elapsed,State,MaxRSS,MaxVMSize

Cancel Jobs

# Cancel a specific job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

# Cancel all pending jobs
scancel -u $USER --state=pending

Job Arrays

Run the same script with different parameters:

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --output=logs/array_%A_%a.log
#SBATCH --array=1-10
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4

# SLURM_ARRAY_TASK_ID contains the array index (1, 2, 3, ... 10)
echo "Running task $SLURM_ARRAY_TASK_ID"
python experiment.py --seed $SLURM_ARRAY_TASK_ID

Submit:

sbatch array_job.sh # submits 10 jobs

Useful array patterns:

#SBATCH --array=1-100 # 1 to 100
#SBATCH --array=1-100%10 # 1 to 100, max 10 running at once
#SBATCH --array=1,3,5,7 # specific values
#SBATCH --array=1-10:2 # 1,3,5,7,9 (step of 2)

Targeting Specific Nodes

By default, SLURM automatically schedules jobs to the best available node. Use --constraint to target a specific node capability.

# Auto-schedule — SLURM decides (preferred for most jobs)
srun --gres=shard:2 --time=02:00:00 --pty bash

# Target the larger node (node01) by capability
srun --constraint=large --gres=shard:2 --time=02:00:00 --pty bash

# Target the smaller node (node02) by capability
srun --constraint=small --gres=shard:1 --time=02:00:00 --pty bash

In a batch script:

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --constraint=large # target node01 by capability
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

python train.py

Using --nodelist / -w (Admin/Debug Use Only)

Use --nodelist only when you need to pin to a specific hostname, such as for debugging a node-specific issue.

# Force job on node01 by hostname
srun -w node01 --gres=shard:2 --time=02:00:00 --pty bash

# Force job on node02 by hostname
srun -w node02 --gres=shard:1 --time=02:00:00 --pty bash
warning

Avoid hard-coding --nodelist=node01 in production scripts. If a node is replaced or renamed, your scripts will break. Use --constraint instead.


Best Practices

Use Shards by Default

# Good — use shards for most work
#SBATCH --gres=shard:2

# Only for large models needing >48GB VRAM on node01
#SBATCH --gres=gpu:1

Let SLURM Schedule Unless You Have a Reason

# Good — SLURM picks the best available node
#SBATCH --gres=shard:2

# Only add constraint if your job truly needs the larger node
#SBATCH --constraint=large
#SBATCH --gres=shard:4

Always Specify --time

Helps the scheduler run shorter jobs sooner:

# Good — scheduler knows job length
srun --time=00:30:00 python quick_test.py

# Less optimal — defaults to 30 minutes even if job takes 5 minutes
srun python quick_test.py

Request Only What You Need

Over-requesting blocks resources from others:

# Bad — requesting all resources
#SBATCH --cpus-per-task=128
#SBATCH --mem=250G
#SBATCH --gres=gpu:1

# Good — request what you actually use
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --gres=shard:2

Test Interactively First

# Get interactive session with shared GPU
srun --gres=shard:2 --time=00:30:00 --pty bash

# Test your code works
python train.py --epochs 1

# Exit and submit real job
exit
sbatch job.sh

Example: PyTorch Training Job

Standard Training (Shared GPU, Auto-Scheduled)

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=shard:2

# Print job info
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"

# Setup environment
source ~/.bashrc
conda activate pytorch

# Run training
python train.py \
--model resnet50 \
--epochs 100 \
--batch-size 64 \
--learning-rate 0.001 \
--output-dir results/$SLURM_JOB_ID

echo "End time: $(date)"

Large Model Training (Full GPU, Large Node Required)

#!/bin/bash
#SBATCH --job-name=large_model_train
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=72:00:00
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --constraint=large
#SBATCH --gres=gpu:1

# Print job info
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"

# Setup environment
source ~/.bashrc
conda activate pytorch

# Run large model training
python train.py \
--model vit_large \
--epochs 50 \
--batch-size 256 \
--learning-rate 0.0001 \
--output-dir results/$SLURM_JOB_ID

echo "End time: $(date)"

Quick Reference

# Interactive session with shared GPU (recommended)
srun --gres=shard:2 --mem=32G --time=02:00:00 --pty bash

# Interactive session with full GPU (large jobs only)
srun --gres=gpu:1 --mem=64G --time=04:00:00 --pty bash

# Interactive session on the larger node specifically
srun --constraint=large --gres=shard:2 --time=02:00:00 --pty bash

# Submit batch job
sbatch job.sh

# Check your jobs
squeue -u $USER

# Check node and GPU availability
sinfo -N -l

# Cancel a job
scancel <job_id>

# Sync files to node02 before running there
rsync -avz ~/myproject node02:~/

# View job output in real-time
tail -f logs/my_job_12345.log

Getting Help

# Manual pages
man srun
man sbatch
man squeue

# Quick help
srun --help
sbatch --help