Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

准备工作

系统架构

  • Slurm Master Head (slurmctld):
    • test-slurm-master
  • Slurm Compute Node (slurmd):
    • test-slurm-node1
    • test-slurm-node2
  • Slurm DataBase Daemon (slurmdbd)

基本系统配置

 sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
 systemctl disable --now firewalld.service
 yum install epel-release -y
 reboot

配置NTP

 yum install ntp chrony -y
#修改文件 /etc/chrony.conf
server ntp-server-1 iburst
#启动服务
systemctl enable --now chronyd.service
systemctl restart chronyd.service
#检查NTP配置
ntpstat
chronyc sources

配置LDAP

Slurm Cluster 中所有服务需要保持 uidgid 一致. 方法有两种:

  1. Cluster中所有服务器创建本地user/group 保持uid和gid一致
  2. Cluster中所有服务器从中央认证服务器LDAP获取用户id信息

推荐使用389ds作为LDAP认证服务器.

389ds及sssd相关配置这里不作讨论.

yum install sssd openldap-clients nfs-utils autofs nfs4-acl-tools -y
systemctl enable --now autofs sssd

安装及配置 munge

yum install munge munge-libs munge-devel -y
# on master head node
/usr/sbin/create-munge-key -f
chown munge: /etc/munge/munge.key
chmod 0400 /etc/munge/munge.key
# send this key to all compute nodes:
scp /etc/munge/munge.key root@test-slurm-node1:/etc/munge
scp /etc/munge/munge.key root@test-slurm-node2:/etc/munge
# on all compute nodes:
chown -R munge: /etc/munge /var/log/munge

# enable munge service on master head node and all compute nodes
systemctl enable --now munge.service

# verify the munge service from **master** head node
munge -n
munge -n | unmunge
munge -n | ssh test-slurm-node1 unmunge
remunge

编译安装slurm

安装依赖

yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad munge-devel mariadb-devel gtk2-devel perl perl-ExtUtils-MakeMaker http-parser-devel json-c-devel -y
yum install rpm-build -y

修改rpmbuild macro

cat ~/.rpmmacro
%_without_debug		"--enable-debug"
%_with_slurmrestd	"--enable-slurmrestd"

编译slurm

rpmbuild -ta slurm-version.tar.bz2

安装slurm

# on master head node
yum localinstall ~/rpmbuild/RPMS/x86_64/*.rpm
# on compute nodes
cd ~/rpmbuild/RPMS/x86_64/
yum localinstall slurm-version.rpm slurm-perlapi-version.rpm slurm-slurmd-version.rpm

安装及配置 MariaDB

MariaDB可以安装在master head node, 也可以独立安装

yum install mariadb-server mariadb-devel
systemctl enable --now mariadb
mysql_secure_installation
mysql -u root -p
#In MariaDB:
MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
MariaDB[(none)]> FLUSH PRIVILEGES;
MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
MariaDB[(none)]> quit;

验证数据库配置

mysql -u slurm -p

输入设置的密码1234. In MariaDB:

MariaDB[(none)]> show grants;
MariaDB[(none)]> quit;

创建文件 /etc/my.cnf.d/innodb.cnf 内容如下

[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

清理log

systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
systemctl start mariadb

You can check the current setting in MySQL like so:

MariaDB[(none)]> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';

文件 /etc/slurm/slurmdbd.conf 大致内容如下:

DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
StoragePass=1234
StorageLoc=slurm_acct_db

修改权限

chown slurm: /etc/slurm/slurmdbd.conf
chmod 0600 /etc/slurm/slurmdbd.conf
touch /var/log/slurmdbd.log
chown slurm: /var/log/slurmdbd.log

测试运行 slurmdbd

slurmdbd -D -vvv
# on master head node
systemctl enable --now slurmdbd

slurm 配置文件

需保持整个cluster集群的slurm.conf配置文件一致.

在compute node上, 可以使用以下命令查看硬件配置

slurmd -C

官方slrum配置生成器

一份slurm.conf配置文件如下(参考):

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=clustername
ControlMachine=test-slurm-master
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#ProctrackType=proctrack/cgroup
#PluginDir=
#FirstJobId=
#ReturnToService=0
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=120
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#JobCompType=jobcomp/none
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=test-slurm-master
AccountingStoragePort=6819
AccountingStoreJobComment=YES
#JobCompType=jobcomp/slurmdbd
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux

#
# COMPUTE NODES
#NodeName=linux[1-2] Procs=1 State=UNKNOWN
NodeName=test-slurm-node1 NodeAddr=x.x.x.x CPUs=4 Sockets=2 ThreadsPerCore=2 State=UNKNOWN
NodeName=test-slurm-node2 NodeAddr=x.x.x.x CPUs=4 Sockets=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=production Nodes=ALL Default=YES MaxTime=INFINITE State=UP

配置开机启动

# on master head node
systemctl enable --now slurmctld.service
# on compute nodes
systemctl enable --now slurmd.service

Slurm 测试指令

# 显示所有compute nodes
scontrol show nodes
# 更新 state
scontrol update nodename=node1 state=resume
# 测试执行
srun -N2 hostname
# 显示作业
scontrol show jobs
# 使用脚本
sbatch -n16 script-file

EXAMPLES

-> sacctmgr create cluster tux

-> sacctmgr create account name=apollo description='Apollo Project' organization=trustnetic

-> sacctmgr add user name=ithelpdesk adminlevel=admin account=apollo

-> sacctmgr show account -s

-> sacctmgr show user -s

-> scontrol show job jobid (display all of a job's characteristics)

-> scontrol -d show job jobid (display all of a job's characteristics, including the batch script)

-> scontrol update JobID=jobid Account=science (change the job's account to the science account)

-> scontrol update JobID=jobid Partition=apollo (change the job's queue to the apollo queue)

-> scontrol hold jobid

-> scontrol release jobid

-> scancel jobid

-> scancel -s signal jobid

-> sacct -j jobid --long

-> sacct -j jobid -o JobID,JobName,AllocCPUS

-> sshare
-> sacctmgr show user user_name WIthAssoc

-> scontrol reconfigure
Reconfiguration of slurm.conf and distribute to all compute and login nodes. 
**make the daemons reread on master node**

Slurm Commands

  • sacct: display accounting data for all jobs and job steps in the Slurm database
  • sacctmgr: display and modify Slurm account information
  • salloc: request an interactive job allocation
  • sattach: attach to a running job step
  • sbatch: submit a batch script to Slrum
  • scancel: cancel a job or job step or signal a running job or job step
  • scontrol: display (and modify when permitted) the status of Slurm entities. Entities include: jobs, job steps, nodes, partitions, reservations, etc.
  • sdiag: display scheduling statistics and timing parameters
  • sinfo: display node partition (queue) summary information
  • sprio: display the factors that comprise a job’s scheduling priority
  • squeue: display the jobs in the scheduling queues, one job per line
  • sreport: generate canned reports from job accounting data and machine utilization statistics
  • srun: launch one or more tasks of an application across requested resources
  • sshare: display the shares and usage for each charge account and user
  • sstat: display process statistics of a running job step
  • sview: a graphical tool for displaying jobs, partitions, reservations, and Blue Gene blocks
  • smap: graphically view information about Slurm jobs, partitions, and set configurations parameters
  • sstat

    sinfo -N -l

Job States

The basic job states are these:

  • Pending - the job is in the queue, waiting to be scheduled
  • Held - the job was submitted, but was put in the held state(ineligible to run)
  • Running - the job has been granted an allocation. If it’s a batch job, the batch script has bee run
  • Complete - the job has completed successfully
  • Timeout - the job was terminated for running longer than its wall clock limit
  • Preempted - the running job was terminated to reassign its resources to a higher QoS job
  • Failed - the job terminated with a non-zero status
  • Node Fail - the job terminated after a compute node reported a problem

For the complete list, see the “JOB STATE CODES” section under the squeue man page.

相关链接:

  • https://hpc.llnl.gov/banks-jobs/running-jobs/slurm-user-manual

  • https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

  • https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/

  • https://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters

  • https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

  • http://www.top500.org/ (Top500 supercomputers)

本文采用 知识共享署名 4.0 国际许可协议(CC-BY 4.0)进行许可。转载请注明来源: https://snowfrs.com/2020/10/24/slurm-install.html 欢迎对文中引用进行考证,欢迎指出任何不准确和模糊之处。