Default Values For Selecting Hardware¶

A number of SLURM mechanisms are available to select different hardware:

partitions
QOS
gres
constraint

Not all of these mechanisms need to be specfied when submitting, and are listed here only for completeness. Users have a default partition and QOS, so do not need to specify them when submitting jobs. However doing so will do no harm, and may remind them about their values.

Name	Default Value
partition	comp
QOS	normal

The values of the defaults is likely to change over time, as we add new hardware and optimize the system. Current values can be found with some of these commands. To view more information on the outputs, please use the manual pages.

man scontrol
man sacctmgr

Command	Values
scontrol show partitions	Lists All Our Partitions, currently short, comp and gpu
scontrol show partition comp	Detailed information on the comp partition including Maximum Wall Time (7 days) and Default Memory per CPU (4096M)
sacctmgr show qos normal\ format=”Name,MaxWall,MaxCPUSPerUser,MaxTresPerUser%20”	Name MaxWall MaxCPUsPU MaxTRESPU normal 7-00:00:00 65 cpu=65,gres/gpu=3

Please note that MonARCH uses the QOS to control how much of the cluster one user can use. For the QOS normal a user has:

a maximum of 65 CPUs (cores)
a maxmimum of 3 GPU cards
a maximum wall time of 7 days

Partitions Available¶

MonARCH hardware is split into several partitions.

The default partition for all submitted jobs is:

comp for compute nodes

Other partitions include:

short for jobs with a walltime < 1 day. This will run only on previously MonV1 hardware.

gpu for the GPU nodes

Example: To use the short partition for jobs < 1 hour, put this in your SLURM submission script.

#SBATCH --partition=short

Selecting a particular CPU Type¶

The hardware available consists of several sort of nodes: All nodes have hyper-threading turned off.

mi* nodes are 36 core Xeon-Gold-6150 @ 2.70GHz servers wtih 158893MB usable memory
gp* nodes are 28 core Xeon-E5-2680-v4 @ 2.40GHz servers with 241660MB usable memory. Each gp server has two P100 GPU cards.
mk* nodes are 48 core Xeon-Platinum-8260 @ 2.4GHz servers with 342000M usable memory.
md* nodes are 48 core Xeon-Gold-5220R @ 2.20GHz servers with 735000M usable memory. Each server has two processors with 28 cores each.
hm00. This single node is 36 core Xeon-Gold-6150 @ 2.7GHz server with 1.4TB usable memory.

Sometimes users may want to constrain themselves to use a particular CPU type, e.g. for timing reasons. In this case, they need to specify this with a constraint flag in the SLURM submissions script. As a parameter, this flag specifies the CPU type needed. The CPU type of a particular node can be viewed by runinng this command:

scontrol show node <nodename>

command and then looking for the Feature field.

Examples:

# this command requests only mi* nodes that have Xeon-Gold processors
#SBATCH --constraint=Xeon-Gold-6150

This feature should only be used if you must have a particular processor. Jobs will schedule faster if you do not use it.

Selecting a particular server¶

Users can specify to use only a particular server if they wish.

Example: Only run jobs on server ge00

#SBATCH --nodelist=ge00

Selecting a GPU Node¶

To request one or more GPU cards, you need to specify:

the gpu partition
the name and type of GPU in a gres statement. Your running program will only be allowed access to the number of cards that you specify.

You should not use the constraint feature described above.

# this command requests one P100 card on a node. This is a gp* machine
#SBATCH --partition=gpu
#SBATCH --gres=gpu:P100:1