More on SLURM

MonARCH uses the SLURM scheduler for running jobs. The home page for SLURM is http://slurm.schedmd.com/, and it is used in many computing systems, such as MASSIVE and VLSCI. SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions.

  1. It allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work.

  2. It provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes.

  3. It arbitrates contention for resources by managing a queue of pending work.

The following material will explain how users can use SLURM. At the bottom of the page there is a PBS, SGE comparison section. SLURM Glossary It is important to understand that some SLURM syntax have meanings which may differ from syntax in other batch or resource schedulers.

SLURM: Glossary

Below is a summary of some SLURM concepts

Term

Description

Task

A task under SLURM is a synonym for a process, and is often the number of MPI processes that are required

Success

A job completes and terminates well (with exit status 0) (cancelled jobs are not considered successful)

Socket

A socket contains one processor

Resource

A mix of CPUs, memory and time

Processor

A processor contains one or more cores

Partition

SLURM groups nodes into sets called partitions. Jobs are submitted to a partition to run. In other batch systems the term queue is used

Node

A node contains one or more sockets

Failure

Anything that lacks success

CPU

The term CPU is used to describe the smallest physical consumable, and for multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.

Core

A CPU core

Batch job

A chain of commands in a script file

Account

The term account is used to describe the entity to which used resources are charged to. This field is not used on the Monarch cluster at the moment.

SLURM: Useful Commands

What

SLURM command

Comment

Job Submission

sbatch jobScript

SLURM directives in the jobs script can also be set by command line options for sbatch.

Check queue

squeue or aliases sq SQ

You can also examine individual jobs, i.e. squeue -j 792412

Check cluster status

show_cluster

This is a nicely printed description of the current state of the machines in our cluster, built on top of the sinfo command.

Deleting an existing job

scancel jobID

Show job information

scontrol show job jobID

Also try show_job for,nicely formatted output.

Suspend a job

scontrol suspend jobID

Resume a job

scontrol resume jobID

Deleting parts of a job array

scancel jobID_[5-10]

SLURM: More on Shell Commands

Users submit jobs to the MonARCH using SLURM commands called from the Unix shell (such as bash, or csh). Typically a user creates a batch submission script that specifies what computing resources they want from the cluster, as well as the commands to execute when the job is running. They then use sbatch <filename> to submit the job. Users can kill, pause and interrogate the jobs they are running. Here is a list of common commands:

sbatch

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

sbatch [options] job.script

scancel

scancel deletes a job from the queue, or stops it running.

scancel jobID1 jobID2
scancel --name=[job name]
scancel --user=[user]

sinfo

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

sinfo [options]
Example:
 sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 comp*        up 7-00:00:00      2    mix gp[00-01]
 comp*        up 7-00:00:00     23   idle gp[02-05],hc00,hs[00-05],mi[00-11]

squeue

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

#Print information only about job step 65552.1:
squeue --steps 65552.1
STEPID     NAME PARTITION    USER    TIME_USE      NODELIST(REASON)
65552.1    test2     debug   alice        12:49 dev[1-4]

We also have set up alises to squeue that prints more information for them.

Alias

Maps to

sq

squeue -u <userid>

SQ

squeue -o”%.18i %.8P %.6a %.15j %.8u %8Q %.8T %.10M %.4c %.4C %.12l %.12L %.6D %.16S %.16V %R

SQ prints more information on the jobs, for all users and can be used like:

SQ -u myUserName

Some squeue options of interest. See man squeue for more information.

squeue option

Meaning

–array

Job arrays are displayed one element per line

–jobs=JobList

Comma separated list of Job IDs to display

–long

Display output in long format

–name=NameList

Filter results based on job name

–partition=PartitionList

Comma separated list of partitions to display

–user=User

Display results based on the listed user name

scontrol

scontrol reports or modify details of a currently running job. (Use sacct to view details on finished jobs)

scontrol show job 71701  #report details of job whose jobID is 71701
scontrol show jobid -dd 71701 # report more details on this job, including the submission script
scontrol hold 71701      #hold a job, prevents the job being scheduled for execution
sctonrol release 71701   #release a job that was previously held manually

sinteractive

It is possible to run a job as an interactive session using ‘ sinteractive ‘. The program hangs until the session is scheduled to run, and then the user is logged into the compute node. Exiting the shell (or logging out) ends the session and the user is returned to the original node.

sinteractive
Waiting for JOBID 25075 to start
Warning: Permanently added 'm2001,172.19.1.1' (RSA) to the list of known hosts.
$hostname
m2001
$exit
[screen is terminating]
Connection to m2001 closed.

sacct

The command sacct shows metrics from past jobs.

sacct -l -j jobID

sstat

The command sstat shows metrics from currently running jobs when given a job number. Note, you need to launch jobs with srun to get this information

Help on shell commands

Users have several ways of getting information on shell commands.

  • The commands have man pages (via the unix manual). e.g. man sbatch

  • The commands have built-in help options. e.g.

sbatch --help
sbatch --usage.
  • There are online manuals and information pages

Most commands have options in two formats:

  • single letter e.g. -N 1

  • verbose e.g. –nodes=1

Note the double dash – in the verbose format. A non-­‐zero exit code indicates failure in a command.

Some default behaviours:

  • SLURM processes launched with srun are not run under a shell, so none of the following are executed:
    • ~/.profile

    • ~/.baschrc

    • ~/.login

  • SLURM exports user environment by default (or –export=NONE)

  • SLURM runs in the current directory (no need to cd $PBS_O_WORKDIR)

  • SLURM combines stdout and stderr and outputs directly (and naming is different). The SLURM stdout /stderr file will be appended,not overwritten (if it exists)

  • SLURM is case insensitive (e.g. project names are lower case)

Batch Scripts

A job script has a header section which specifies the resources that are required to run the job as well as the commands that must be executed. An example script is shown below.

!/bin/env bash

#SBATCH --job-name=example
#SBATCH --time=01:00:00
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000


module load intel
uname -a
srun uname -a

Here are some of the SLURM directives you can use in a batch script. man sbatch will give you more information.

SLURM directive

Description

–job-name=[job name]

The job name for the allocation, defaults to the script name.

–partition=[partition name]

Request an allocation on the specified partition. If not specified jobs will be submitted to the default partition.

–time=[time spec]

The total walltime for the job allocation.

–array=[job spec]

Submit a job array with the defined indices.

–dependency=[dependency list]

Specify a job dependency.

–nodes=[total nodes]

Specify the total number of nodes.

–ntasks=[total tasks]

Specify the total number of tasks.

–ntasks-per-node=[ntasks]

Specify the number of tasks per node.

–cpus-per-task=[ncpus]

Specify the number of CPUs per task.

–ntasks-per-core=[ntasks]

Specify the number of tasks per CPU core.

–export=,[variable|ALL|NONE]

Specify what environment variables to export.NOTE: SLURM will copy the entire environment from the shell where a job is submitted from. This may break existing batch scripts that require a different environment than say a login environment. To guard against this –export=NONE can be specified for each batch script.

Comparing PBS/SGE instructions with SLURM

This section provides a brief comparison of PBS, SGE and SLURM input parameters. Please note that in some cases there is no direct equivalent between the different systems.

Basic Job commands

Comment

MASSIVE PBS

SGE

SLURM

Give the job a name.

#PBS -N JobName

#$ -N JobName

#SBATCH –job-name=JobName

or

#SBATCH -J JobName #Note, the job name appears in the queue list, but is not used to name the output files #(opposite behaviour to PBS, SGE)

Redirect standard output of job to this file.

#PBS -o path

#$ -o path

#SBATCH –output=path/file-%j.ext1 #SBATCH -o path/file-%j.ext.1 # Note: %j is replaced by the job number

Redirect standard error of job to this file.

#PBS -e path

#$ -e path

#SBATCH –error=path/file-%j.ext2 #SBATCH -e path/file-%j.ext2 # Note: %j is replaced by the job number.

Commands to specify accounts, queues and working directories.

Comment

MASSIVE PBS

SGE

SLURM

Account to charge quota. (if so set up)

#PBS -A AccountName

#SBATCH –account=AccountName #SBATCH -A AccountName

Walltime

#PBS -l walltime=2:23:59:59

#$ -l h_rt=hh:mm:ss e.g. #$ -l h_rt=96:00:00

#SBATCH –time=2-23:59:59 #SBATCH -t 2-23:59:59 Note ‘-’ between day(s) and hours for SLURM.

Change to the directory that the script was launched from

cd $PBS_O_WORKDIR

#$ -cwd

This is the default for SLURM.

Specify a queue (partition)

#PBS -q batch

#$ -cwd

#SBATCH –partition=main #SBATCH -p main In SLURM a queue is called a partition, and the default is ‘batch’.

Instructions to request nodes, sockets and cores.

Comment

MASSIVE PBS

SGE

SLURM

The number of compute cores to ask for.

#PBS -l nodes=1:ppn=12 Asking for 12 CPU cores, which is all the cores on a MASSIVE node. You could put “nodes=1” for a single CPU core job or “nodes=1:ppn=4” to get four cpu cores on the one node (typically for multithreaded, smp or openMP jobs).

#$ -pe smp 12 #$ -pe orte_adv 12 MCC SGE did not implement running jobs across machines, due to limitations of the interconnection hardware.

#SBATCH –nodes=1 –ntasks=12 or #SBATCH -N1 -n12 –ntasks is not used in isolation but combined with other commands such as –nodes=1

The number of tasks per socket

–ntasks-per-socket= Request the maximum ntasks be invoked on each socket. Meant to be used with the –ntasks option. Related to –ntasks-per-node except at the socket level instead of the node level

Cores per task (for use with openMP)

–cpus-per-task=ncpus or -c ncpus Request that ncpus be allocated per process. The default is one CPU per process.

Specify per core memory.

##PBS -l pmem=4000MB Specifies how much memory you need per CPU core (1000MB if not specified)

#No equivalent. SGE uses memory/process

–mem-per-cpu=24576 or –mem=24576 SLURM default unit is MB.

Commands to notify user of job progress.

Comment

MASSIVE PBS

SGE

SLURM

Send email notification when:,job fails.

#PBS -m a

#$ -m a

#SBATCH –mail-type=FAIL

Send email notification when:,job fails.

#PBS -m b

#$ -m b

#SBATCH –mail-type=BEGIN

Send email notification when: job stops

#PBS -m e

#$ -m e

#SBATCH –mail-type=END

e-mail address to send information to.

#PBS -M name@email.address

#$ -M name@email.address

#SBATCH –mail-user= name@email.address

Advanced SLURM

This section focuses on how to specify the different CPU resources you need. See below for a block diagram of a typical compute node. This consists of a mother board with two CPU sockets, and in each socket is a 8-core CPU. (Note, not all machines are like this of course. The motherboard may have more CPU sockets, and their may be more cores/CPU in each processor).

A process (or task) is identified by a thick red line. If a core is used by a thread or process, it is coloured black. So in the diagram below we have one process that has one thread.

../../_images/Numa_1_core_1_task.jpg

In the following example, there is one process that is using two threads. These threads have been allocated to two different sockets.

../../_images/Numa_2_cores_2_sockets.jpg

Serial Example

This section describes how to specify a serial job - namely one that uses only one core on one processor.

../../_images/Numa_1_core_1_task.jpg
#!/bin/env bash

#SBATCH  --ntasks=1
#SBATCH  --ntasks-per-node=1
#SBATCH  --cpus-per-task=1
serialProg.exe

OpenMP

OpenMP is a thread-based technology based on shared memory systems, i.e. OpenMP programs only run on one computer. It is characterized by having one process running multiple threads.

SLURM does not set OMP_NUM_THREADS in the environment of a job. Users should manually add this to their batch scripts, which is normally the same as that specified with –cpus-per-task

To compile your code with OpenMP

Compiler

Option To Use When Compiling

gcc

-fopenmp

icc

-openmp

Note that the command below explicitly states the NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access) configuration of the processes. You need not specify all the commands if this is not important.

8 Cores on 1 Socket

Suppose we had a single process that we wanted to run on all the cores on one socket.

../../_images/Numa_1_process_1_socket_8_cores.jpg
#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-socket=1


# Set OMP_NUM_THREADS to the same value as   --cpus-per-task=8
export OMP_NUM_THREADS=8
./openmp_program.exe

8 Cores spread across 2 Sockets

Now suppose we wanted to run the same job but with four cores per socket

../../_images/Numa_1_process_2_sockets_8_cores.jpg
#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --cores-per-socket=4

# Set OMP_NUM_THREADS to the same value as   --cpus-per-task=8
export OMP_NUM_THREADS=8
openmp_program.exe

MPI Examples

This section contains a number of MPI examples to illustrate their usage with SLURM. MPI processes can be started with the srun command, or the traditional mpirun or mpiexe. There is no need to specify the number of process to run (-np) as this is automatically read in from SLURM environment variables.

For MonARCH-only:

  • If your Monarch MPI job spans more than one node, the flags –mca btl_tcp_if_exclude virbr0 are needed with mpirun to ensure the correct network interface is chosen.

  • Instead of hard-coding the number of processors into the mpirun command line, i.e. 32, you may want to use a SLURM environment variable, SLURM_NTASKS, to make the script more generic. (Note that if you use threading or openMP you will need to modify the parameter. See below)

Note that the command below explicitly states the NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access) configuration of the processes. You need not specify all the commands if this is not important.

Four processes in one socket

Suppose we want to use 4 cores in one socket for our four MPI processes, with each process running on one core each.

../../_images/Numa_4_processes_1_socket.jpg
#!/bin/env bash
#SBATCH  --ntasks=4
#SBATCH  --ntasks-per-node=4
#SBATCH  --cpus-per-task=1
#SBATCH  --ntasks-per-socket=4

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.

module load openmpi
srun myMpiProg.exe                # or mpirun mpiProg.ex

or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Four processes spread across 2 sockets

Suppose we want to use 4 cores spread across two socket for our four MPI processes, with each process running on one core each. (This configuration may minimize memory IO contention).

../../_images/Numa_4_processes_2_socket.jpg
#!/bin/env bash
#SBATCH  --ntasks=4
#SBATCH  --ntasks-per-socket=2
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.

module load openmpi
srun mpiProg.exe                    # or  mpirun mpiProg.exe

   or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Using all the cores in one computer

../../_images/Numa_16_processes_2_socket.jpg
#!/bin/env bash
#SBATCH  --ntasks=16
#SBATCH  --ntasks-per-socket=8
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=1
#  but this is implied already from the parameters we have specified.


module  load openmpi
srun myMpiProg.exe                  #or mpirun mpiProg.exe

           or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Using all the cores in 2 computers

../../_images/Numa_2_full_servers.jpg
#!/bin/env bash
#SBATCH  --ntasks=32
#SBATCH  --cpus-per-task=1

#  you can also explicitly state
#  #SBATCH --nodes=2
#  but this is implied already from the parameters we have specified.

module load openmpi
srun mpiProg.exe

or

FLAGS="--mca btl_tcp_if_exclude virbr0"
mpirun -n  $SLURM_NTASKS $FLAGS myMpiProg.exe

Hybrid OpenMP/MPI Jobs

It is possible to run MPI tasks which are in turn multi-threaded, e.g. with OpenMPI. Here are some examples to assist you.

Two nodes with 1 MPI process per node and 16 OpenMP threads each

../../_images/Numa_hybrid_2servers_2_process_allcores.jpg
#!/bin/env bash
#SBATCH --job-name=MPIOpenMP
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=16
srun ./HybrudMpiOpenMPI.exe

Using GPUs

This is how to invoke GPUs on SLURM on MonARCH.

One GPU and one core.

../../_images/Numa_1_core_1_gpu.jpg
#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-socket=1
#SBATCH --gres=gpu:K80:1

./gpu_program.exe

Two GPU and all cores.

../../_images/Numa_all_cores_2_gpu.jpg
#!/bin/env bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:K80:2

./gpu_program.exe

SLURM: ARRAY JOBS

It is possible to submit array jobs, that are useful for running parametric tasks. The command

#SBATCH --array=1-300

specifies that 300 jobs are submitted to the queue, and each one has a unique identifier specified in the environment variable SLURM_ARRAY_TASK_ID (in this case ranging from 1 to 300).

#!/bin/env bash
#SBATCH --job-name=sample_array
#SBATCH --time=10:00:00
#SBATCH --mem=4000
# make 300 different jobs from this one script!
#SBATCH --array=1-300
#SBATCH --output=job.out

module load modulefile
myExe.exe ${SLURM_ARRAY_TASK_ID} # equivalent to SGE’s ${SGE_TASK_ID}

There are pre-configured limits to how many array jobs that you can submit in a single request.

scontrol show config | grep MaxArraySize
MaxArraySize            = 5000

So in MonARCH, you can only submit 5000 array jobs per submission. This parameter also constrains the maximum index size to 5000.

If you want to exceed the index size, you can use arithmetic inside your Slurm submission script. For example, to increase the index so that you scan from 5001 to 10000 you can go

#SBATCH --array=1-5000

x=5000
new_index=$((${SLURM_ARRAY_TASK_ID} + $x))
#new_index will then range from 5001 to 10000

SLURM: EMAIL NOTIFICATION

SLURM can email users with information on their job. This is enabled with the following flags.

#!/bin/env bash
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=researcher@monash.edu

See man sbatch for all options assoicated with the email alerts.