More on SLURM¶

MonARCH uses the SLURM scheduler for running jobs. The home page for SLURM is http://slurm.schedmd.com/, and it is used in many computing systems, such as MASSIVE and VLSCI. SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions.

It allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work.
It provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes.
It arbitrates contention for resources by managing a queue of pending work.

The following material will explain how users can use SLURM. At the bottom of the page there is a PBS, SGE comparison section. SLURM Glossary It is important to understand that some SLURM syntax have meanings which may differ from syntax in other batch or resource schedulers.

SLURM: Glossary¶

Below is a summary of some SLURM concepts

Term	Description
Task	A task under SLURM is a synonym for a process, and is often the number of MPI processes that are required
Success	A job completes and terminates well (with exit status 0) (cancelled jobs are not considered successful)
Socket	A socket contains one processor
Resource	A mix of CPUs, memory and time
Processor	A processor contains one or more cores
Partition	SLURM groups nodes into sets called partitions. Jobs are submitted to a partition to run. In other batch systems the term queue is used
Node	A node contains one or more sockets
Failure	Anything that lacks success
CPU	The term CPU is used to describe the smallest physical consumable, and for multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.
Core	A CPU core
Batch job	A chain of commands in a script file
Account	The term account is used to describe the entity to which used resources are charged to. This field is not used on the Monarch cluster at the moment.

SLURM: Useful Commands¶

What	SLURM command	Comment
Job Submission	sbatch jobScript	SLURM directives in the jobs script can also be set by command line options for sbatch.
Check queue	squeue or aliases sq SQ	You can also examine individual jobs, i.e. squeue -j 792412
Check cluster status	show_cluster	This is a nicely printed description of the current state of the machines in our cluster, built on top of the sinfo command.
Deleting an existing job	scancel jobID
Show job information	scontrol show job jobID	Also try show_job for,nicely formatted output.
Suspend a job	scontrol suspend jobID
Resume a job	scontrol resume jobID
Deleting parts of a job array	scancel jobID_[5-10]

SLURM: More on Shell Commands¶

Users submit jobs to the MonARCH using SLURM commands called from the Unix shell (such as bash, or csh). Typically a user creates a batch submission script that specifies what computing resources they want from the cluster, as well as the commands to execute when the job is running. They then use sbatch <filename> to submit the job. Users can kill, pause and interrogate the jobs they are running. Here is a list of common commands:

sbatch¶

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

sbatch [options] job.script

scancel¶

scancel deletes a job from the queue, or stops it running.

scancel jobID1 jobID2
scancel --name=[job name]
scancel --user=[user]

sinfo¶

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

sinfo [options]
Example:
 sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 comp*        up 7-00:00:00      2    mix gp[00-01]
 comp*        up 7-00:00:00     23   idle gp[02-05],hc00,hs[00-05],mi[00-11]

squeue¶

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

#Print information only about job step 65552.1:
squeue --steps 65552.1
STEPID     NAME PARTITION    USER    TIME_USE      NODELIST(REASON)
65552.1    test2     debug   alice        12:49 dev[1-4]

We also have set up alises to squeue that prints more information for them.

Alias	Maps to
sq	squeue -u <userid>
SQ	squeue -o”%.18i %.8P %.6a %.15j %.8u %8Q %.8T %.10M %.4c %.4C %.12l %.12L %.6D %.16S %.16V %R

SQ prints more information on the jobs, for all users and can be used like:

SQ -u myUserName

Some squeue options of interest. See man squeue for more information.

squeue option	Meaning
–array	Job arrays are displayed one element per line
–jobs=JobList	Comma separated list of Job IDs to display
–long	Display output in long format
–name=NameList	Filter results based on job name
–partition=PartitionList	Comma separated list of partitions to display
–user=User	Display results based on the listed user name

scontrol¶

scontrol reports or modify details of a currently running job. (Use sacct to view details on finished jobs)

scontrol show job 71701  #report details of job whose jobID is 71701
scontrol show jobid -dd 71701 # report more details on this job, including the submission script
scontrol hold 71701      #hold a job, prevents the job being scheduled for execution
sctonrol release 71701   #release a job that was previously held manually

sinteractive¶

It is possible to run a job as an interactive session using ‘ sinteractive ‘. The program hangs until the session is scheduled to run, and then the user is logged into the compute node. Exiting the shell (or logging out) ends the session and the user is returned to the original node.

sinteractive
Waiting for JOBID 25075 to start
Warning: Permanently added 'm2001,172.19.1.1' (RSA) to the list of known hosts.
$hostname
m2001
$exit
[screen is terminating]
Connection to m2001 closed.

sacct¶

The command sacct shows metrics from past jobs.

sacct -l -j jobID

sstat¶

The command sstat shows metrics from currently running jobs when given a job number. Note, you need to launch jobs with srun to get this information

Help on shell commands¶

Users have several ways of getting information on shell commands.

The commands have man pages (via the unix manual). e.g. man sbatch
The commands have built-in help options. e.g.

sbatch --help
sbatch --usage.

There are online manuals and information pages

Most commands have options in two formats:

single letter e.g. -N 1
verbose e.g. –nodes=1

Note the double dash – in the verbose format. A non-‐zero exit code indicates failure in a command.

Some default behaviours:

SLURM processes launched with srun are not run under a shell, so none of the following are executed:
- ~/.profile
- ~/.baschrc
- ~/.login
SLURM exports user environment by default (or –export=NONE)
SLURM runs in the current directory (no need to cd $PBS_O_WORKDIR)
SLURM combines stdout and stderr and outputs directly (and naming is different). The SLURM stdout /stderr file will be appended,not overwritten (if it exists)
SLURM is case insensitive (e.g. project names are lower case)

Batch Scripts¶

A job script has a header section which specifies the resources that are required to run the job as well as the commands that must be executed. An example script is shown below.

!/bin/env bash

#SBATCH --job-name=example
#SBATCH --time=01:00:00
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000


module load intel
uname -a
srun uname -a

Here are some of the SLURM directives you can use in a batch script. man sbatch will give you more information.

SLURM directive	Description
–job-name=[job name]	The job name for the allocation, defaults to the script name.
–partition=[partition name]	Request an allocation on the specified partition. If not specified jobs will be submitted to the default partition.
–time=[time spec]	The total walltime for the job allocation.
–array=[job spec]	Submit a job array with the defined indices.
–dependency=[dependency list]	Specify a job dependency.
–nodes=[total nodes]	Specify the total number of nodes.
–ntasks=[total tasks]	Specify the total number of tasks.
–ntasks-per-node=[ntasks]	Specify the number of tasks per node.
–cpus-per-task=[ncpus]	Specify the number of CPUs per task.
–ntasks-per-core=[ntasks]	Specify the number of tasks per CPU core.
–export=,[variable\|ALL\|NONE]	Specify what environment variables to export.NOTE: SLURM will copy the entire environment from the shell where a job is submitted from. This may break existing batch scripts that require a different environment than say a login environment. To guard against this –export=NONE can be specified for each batch script.

Comparing PBS/SGE instructions with SLURM¶

This section provides a brief comparison of PBS, SGE and SLURM input parameters. Please note that in some cases there is no direct equivalent between the different systems.

Basic Job commands¶

Comment	MASSIVE PBS	SGE	SLURM
Give the job a name.	#PBS -N JobName	#$ -N JobName	#SBATCH –job-name=JobName or #SBATCH -J JobName #Note, the job name appears in the queue list, but is not used to name the output files #(opposite behaviour to PBS, SGE)
Redirect standard output of job to this file.	#PBS -o path	#$ -o path	#SBATCH –output=path/file-%j.ext1 #SBATCH -o path/file-%j.ext.1 # Note: %j is replaced by the job number
Redirect standard error of job to this file.	#PBS -e path	#$ -e path	#SBATCH –error=path/file-%j.ext2 #SBATCH -e path/file-%j.ext2 # Note: %j is replaced by the job number.

Commands to specify accounts, queues and working directories.¶

Comment	MASSIVE PBS	SGE	SLURM
Account to charge quota. (if so set up)	#PBS -A AccountName		#SBATCH –account=AccountName #SBATCH -A AccountName
Walltime	#PBS -l walltime=2:23:59:59	#$ -l h_rt=hh:mm:ss e.g. #$ -l h_rt=96:00:00	#SBATCH –time=2-23:59:59 #SBATCH -t 2-23:59:59 Note ‘-’ between day(s) and hours for SLURM.
Change to the directory that the script was launched from	cd $PBS_O_WORKDIR	#$ -cwd	This is the default for SLURM.
Specify a queue (partition)	#PBS -q batch	#$ -cwd	#SBATCH –partition=main #SBATCH -p main In SLURM a queue is called a partition, and the default is ‘batch’.

Instructions to request nodes, sockets and cores.¶

Comment	MASSIVE PBS	SGE	SLURM
The number of compute cores to ask for.	#PBS -l nodes=1:ppn=12 Asking for 12 CPU cores, which is all the cores on a MASSIVE node. You could put “nodes=1” for a single CPU core job or “nodes=1:ppn=4” to get four cpu cores on the one node (typically for multithreaded, smp or openMP jobs).	#$ -pe smp 12 #$ -pe orte_adv 12 MCC SGE did not implement running jobs across machines, due to limitations of the interconnection hardware.	#SBATCH –nodes=1 –ntasks=12 or #SBATCH -N1 -n12 –ntasks is not used in isolation but combined with other commands such as –nodes=1
The number of tasks per socket			–ntasks-per-socket= Request the maximum ntasks be invoked on each socket. Meant to be used with the –ntasks option. Related to –ntasks-per-node except at the socket level instead of the node level
Cores per task (for use with openMP)			–cpus-per-task=ncpus or -c ncpus Request that ncpus be allocated per process. The default is one CPU per process.
Specify per core memory.	##PBS -l pmem=4000MB Specifies how much memory you need per CPU core (1000MB if not specified)	#No equivalent. SGE uses memory/process	–mem-per-cpu=24576 or –mem=24576 SLURM default unit is MB.

Commands to notify user of job progress.¶

Comment	MASSIVE PBS	SGE	SLURM
Send email notification when:,job fails.	#PBS -m a	#$ -m a	#SBATCH –mail-type=FAIL
Send email notification when:,job fails.	#PBS -m b	#$ -m b	#SBATCH –mail-type=BEGIN
Send email notification when: job stops	#PBS -m e	#$ -m e	#SBATCH –mail-type=END
e-mail address to send information to.	#PBS -M name@email.address	#$ -M name@email.address	#SBATCH –mail-user= name@email.address

Advanced SLURM¶

This section focuses on how to specify the different CPU resources you need. See below for a block diagram of a typical compute node. This consists of a mother board with two CPU sockets, and in each socket is a 8-core CPU. (Note, not all machines are like this of course. The motherboard may have more CPU sockets, and their may be more cores/CPU in each processor).

A process (or task) is identified by a thick red line. If a core is used by a thread or process, it is coloured black. So in the diagram below we have one process that has one thread.

In the following example, there is one process that is using two threads. These threads have been allocated to two different sockets.

../../_images/Numa_2_cores_2_sockets.jpg

Serial Example¶

This section describes how to specify a serial job - namely one that uses only one core on one processor.

#!/bin/env bash

#SBATCH  --ntasks=1
#SBATCH  --ntasks-per-node=1
#SBATCH  --cpus-per-task=1
serialProg.exe

OpenMP¶

OpenMP is a thread-based technology based on shared memory systems, i.e. OpenMP programs only run on one computer. It is characterized by having one process running multiple threads.

SLURM does not set OMP_NUM_THREADS in the environment of a job. Users should manually add this to their batch scripts, which is normally the same as that specified with –cpus-per-task

To compile your code with OpenMP

Compiler	Option To Use When Compiling
gcc	-fopenmp
icc	-openmp

Note that the command below explicitly states the NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access) configuration of the processes. You need not specify all the commands if this is not important.

8 Cores on 1 Socket¶

Suppose we had a single process that we wanted to run on all the cores on one socket.

../../_images/Numa_1_process_1_socket_8_cores.jpg

#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-socket=1


# Set OMP_NUM_THREADS to the same value as   --cpus-per-task=8
export OMP_NUM_THREADS=8
./openmp_program.exe

8 Cores spread across 2 Sockets¶

Now suppose we wanted to run the same job but with four cores per socket

../../_images/Numa_1_process_2_sockets_8_cores.jpg

#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --cores-per-socket=4

# Set OMP_NUM_THREADS to the same value as   --cpus-per-task=8
export OMP_NUM_THREADS=8
openmp_program.exe

MPI Examples¶

This section contains a number of MPI examples to illustrate their usage with SLURM. MPI processes can be started with the srun command, or the traditional mpirun or mpiexe. There is no need to specify the number of process to run (-np) as this is automatically read in from SLURM environment variables.

For MonARCH-only:

If your Monarch MPI job spans more than one node, the flags –mca btl_tcp_if_exclude virbr0 are needed with mpirun to ensure the correct network interface is chosen.
Instead of hard-coding the number of processors into the mpirun command line, i.e. 32, you may want to use a SLURM environment variable, SLURM_NTASKS, to make the script more generic. (Note that if you use threading or openMP you will need to modify the parameter. See below)

Note that the command below explicitly states the NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access) configuration of the processes. You need not specify all the commands if this is not important.

Four processes in one socket¶

Suppose we want to use 4 cores in one socket for our four MPI processes, with each process running on one core each.

../../_images/Numa_4_processes_1_socket.jpg

#!/bin/env bash
#SBATCH  --ntasks=4
#SBATCH  --ntasks-per-node=4
#SBATCH  --cpus-per-task=1
#SBATCH  --ntasks-per-socket=4

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.

module load openmpi
srun myMpiProg.exe                # or mpirun mpiProg.ex

or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Four processes spread across 2 sockets¶

Suppose we want to use 4 cores spread across two socket for our four MPI processes, with each process running on one core each. (This configuration may minimize memory IO contention).

../../_images/Numa_4_processes_2_socket.jpg

#!/bin/env bash
#SBATCH  --ntasks=4
#SBATCH  --ntasks-per-socket=2
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.

module load openmpi
srun mpiProg.exe                    # or  mpirun mpiProg.exe

   or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Using all the cores in one computer¶

../../_images/Numa_16_processes_2_socket.jpg

#!/bin/env bash
#SBATCH  --ntasks=16
#SBATCH  --ntasks-per-socket=8
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.


module  load openmpi
srun myMpiProg.exe                  #or mpirun mpiProg.exe

           or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Using all the cores in 2 computers¶

#!/bin/env bash
#SBATCH  --ntasks=32
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=2
#  but this is implied already from the parameters we have specified.

module load openmpi
srun mpiProg.exe

or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Hybrid OpenMP/MPI Jobs¶

It is possible to run MPI tasks which are in turn multi-threaded, e.g. with OpenMPI. Here are some examples to assist you.

Two nodes with 1 MPI process per node and 16 OpenMP threads each¶

../../_images/Numa_hybrid_2servers_2_process_allcores.jpg

#!/bin/env bash
#SBATCH --job-name=MPIOpenMP
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=16
srun ./HybrudMpiOpenMPI.exe

Using GPUs¶

This is how to invoke GPUs on SLURM on MonARCH.

One GPU and one core.¶

#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-socket=1
#SBATCH --gres=gpu:K80:1

./gpu_program.exe

Two GPU and all cores.¶

#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:K80:2

./gpu_program.exe

SLURM: ARRAY JOBS¶

It is possible to submit array jobs, that are useful for running parametric tasks. The command

#SBATCH --array=1-300

specifies that 300 jobs are submitted to the queue, and each one has a unique identifier specified in the environment variable SLURM_ARRAY_TASK_ID (in this case ranging from 1 to 300).

#!/bin/env bash
#SBATCH --job-name=sample_array
#SBATCH --time=10:00:00
#SBATCH --mem=4000
# make 300 different jobs from this one script!
#SBATCH --array=1-300
#SBATCH --output=job.out

module load modulefile
myExe.exe ${SLURM_ARRAY_TASK_ID} # equivalent to SGE’s ${SGE_TASK_ID}

There are pre-configured limits to how many array jobs that you can submit in a single request.

scontrol show config | grep MaxArraySize
MaxArraySize            = 5000

So in MonARCH, you can only submit 5000 array jobs per submission. This parameter also constrains the maximum index size to 5000.

If you want to exceed the index size, you can use arithmetic inside your Slurm submission script. For example, to increase the index so that you scan from 5001 to 10000 you can go

#SBATCH --array=1-5000

x=5000
new_index=$((${SLURM_ARRAY_TASK_ID} + $x))
#new_index will then range from 5001 to 10000

SLURM: EMAIL NOTIFICATION¶

SLURM can email users with information on their job. This is enabled with the following flags.

#!/bin/env bash
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=researcher@monash.edu

See man sbatch for all options assoicated with the email alerts.