Slurm Introduce
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
准备工作
系统架构
- Slurm Master Head (
slurmctld
):- test-slurm-master
- Slurm Compute Node (
slurmd
):- test-slurm-node1
- test-slurm-node2
- Slurm DataBase Daemon (
slurmdbd
)
基本系统配置
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
systemctl disable --now firewalld.service
yum install epel-release -y
reboot
配置NTP
yum install ntp chrony -y
#修改文件 /etc/chrony.conf
server ntp-server-1 iburst
#启动服务
systemctl enable --now chronyd.service
systemctl restart chronyd.service
#检查NTP配置
ntpstat
chronyc sources
配置LDAP
Slurm Cluster 中所有服务需要保持 uid
和 gid
一致.
方法有两种:
- Cluster中所有服务器创建本地user/group 保持uid和gid一致
- Cluster中所有服务器从中央认证服务器LDAP获取用户id信息
推荐使用389ds作为LDAP认证服务器.
389ds及sssd相关配置这里不作讨论.
yum install sssd openldap-clients nfs-utils autofs nfs4-acl-tools -y
systemctl enable --now autofs sssd
安装及配置 munge
yum install munge munge-libs munge-devel -y
# on master head node
/usr/sbin/create-munge-key -f
chown munge: /etc/munge/munge.key
chmod 0400 /etc/munge/munge.key
# send this key to all compute nodes:
scp /etc/munge/munge.key root@test-slurm-node1:/etc/munge
scp /etc/munge/munge.key root@test-slurm-node2:/etc/munge
# on all compute nodes:
chown -R munge: /etc/munge /var/log/munge
# enable munge service on master head node and all compute nodes
systemctl enable --now munge.service
# verify the munge service from **master** head node
munge -n
munge -n | unmunge
munge -n | ssh test-slurm-node1 unmunge
remunge
编译安装slurm
安装依赖
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad munge-devel mariadb-devel gtk2-devel perl perl-ExtUtils-MakeMaker http-parser-devel json-c-devel -y
yum install rpm-build -y
修改rpmbuild macro
cat ~/.rpmmacro
%_without_debug "--enable-debug"
%_with_slurmrestd "--enable-slurmrestd"
编译slurm
rpmbuild -ta slurm-version.tar.bz2
安装slurm
# on master head node
yum localinstall ~/rpmbuild/RPMS/x86_64/*.rpm
# on compute nodes
cd ~/rpmbuild/RPMS/x86_64/
yum localinstall slurm-version.rpm slurm-perlapi-version.rpm slurm-slurmd-version.rpm
安装及配置 MariaDB
MariaDB可以安装在master head node, 也可以独立安装
yum install mariadb-server mariadb-devel
systemctl enable --now mariadb
mysql_secure_installation
mysql -u root -p
#In MariaDB:
MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
MariaDB[(none)]> FLUSH PRIVILEGES;
MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
MariaDB[(none)]> quit;
验证数据库配置
mysql -u slurm -p
输入设置的密码1234
. In MariaDB:
MariaDB[(none)]> show grants;
MariaDB[(none)]> quit;
创建文件 /etc/my.cnf.d/innodb.cnf
内容如下
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
清理log
systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
systemctl start mariadb
You can check the current setting in MySQL like so:
MariaDB[(none)]> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
文件 /etc/slurm/slurmdbd.conf
大致内容如下:
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
StoragePass=1234
StorageLoc=slurm_acct_db
修改权限
chown slurm: /etc/slurm/slurmdbd.conf
chmod 0600 /etc/slurm/slurmdbd.conf
touch /var/log/slurmdbd.log
chown slurm: /var/log/slurmdbd.log
测试运行 slurmdbd
slurmdbd -D -vvv
# on master head node
systemctl enable --now slurmdbd
slurm 配置文件
需保持整个cluster集群的slurm.conf
配置文件一致.
在compute node上, 可以使用以下命令查看硬件配置
slurmd -C
官方slrum配置生成器
一份slurm.conf
配置文件如下(参考):
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=clustername
ControlMachine=test-slurm-master
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#ProctrackType=proctrack/cgroup
#PluginDir=
#FirstJobId=
#ReturnToService=0
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=120
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#JobCompType=jobcomp/none
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=test-slurm-master
AccountingStoragePort=6819
AccountingStoreJobComment=YES
#JobCompType=jobcomp/slurmdbd
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux
#
# COMPUTE NODES
#NodeName=linux[1-2] Procs=1 State=UNKNOWN
NodeName=test-slurm-node1 NodeAddr=x.x.x.x CPUs=4 Sockets=2 ThreadsPerCore=2 State=UNKNOWN
NodeName=test-slurm-node2 NodeAddr=x.x.x.x CPUs=4 Sockets=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=production Nodes=ALL Default=YES MaxTime=INFINITE State=UP
配置开机启动
# on master head node
systemctl enable --now slurmctld.service
# on compute nodes
systemctl enable --now slurmd.service
Slurm 测试指令
# 显示所有compute nodes
scontrol show nodes
# 更新 state
scontrol update nodename=node1 state=resume
# 测试执行
srun -N2 hostname
# 显示作业
scontrol show jobs
# 使用脚本
sbatch -n16 script-file
EXAMPLES
-> sacctmgr create cluster tux
-> sacctmgr create account name=apollo description='Apollo Project' organization=trustnetic
-> sacctmgr add user name=ithelpdesk adminlevel=admin account=apollo
-> sacctmgr show account -s
-> sacctmgr show user -s
-> scontrol show job jobid (display all of a job's characteristics)
-> scontrol -d show job jobid (display all of a job's characteristics, including the batch script)
-> scontrol update JobID=jobid Account=science (change the job's account to the science account)
-> scontrol update JobID=jobid Partition=apollo (change the job's queue to the apollo queue)
-> scontrol hold jobid
-> scontrol release jobid
-> scancel jobid
-> scancel -s signal jobid
-> sacct -j jobid --long
-> sacct -j jobid -o JobID,JobName,AllocCPUS
-> sshare
-> sacctmgr show user user_name WIthAssoc
-> scontrol reconfigure
Reconfiguration of slurm.conf and distribute to all compute and login nodes.
**make the daemons reread on master node**
Slurm Commands
sacct
: display accounting data for all jobs and job steps in the Slurm databasesacctmgr
: display and modify Slurm account informationsalloc
: request an interactive job allocationsattach
: attach to a running job stepsbatch
: submit a batch script to Slrumscancel
: cancel a job or job step or signal a running job or job stepscontrol
: display (and modify when permitted) the status of Slurm entities. Entities include: jobs, job steps, nodes, partitions, reservations, etc.sdiag
: display scheduling statistics and timing parameterssinfo
: display node partition (queue) summary informationsprio
: display the factors that comprise a job’s scheduling prioritysqueue
: display the jobs in the scheduling queues, one job per linesreport
: generate canned reports from job accounting data and machine utilization statisticssrun
: launch one or more tasks of an application across requested resourcessshare
: display the shares and usage for each charge account and usersstat
: display process statistics of a running job stepsview
: a graphical tool for displaying jobs, partitions, reservations, and Blue Gene blockssmap
: graphically view information about Slurm jobs, partitions, and set configurations parameters-
sstat
sinfo -N -l
Job States
The basic job states are these:
Pending
- the job is in the queue, waiting to be scheduledHeld
- the job was submitted, but was put in the held state(ineligible to run)Running
- the job has been granted an allocation. If it’s a batch job, the batch script has bee runComplete
- the job has completed successfullyTimeout
- the job was terminated for running longer than its wall clock limitPreempted
- the running job was terminated to reassign its resources to a higher QoS jobFailed
- the job terminated with a non-zero statusNode Fail
- the job terminated after a compute node reported a problem
For the complete list, see the “JOB STATE CODES” section under the squeue man page
.
相关链接:
-
https://hpc.llnl.gov/banks-jobs/running-jobs/slurm-user-manual
-
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
-
https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
-
https://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters
-
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
-
http://www.top500.org/ (Top500 supercomputers)
本文采用 知识共享署名 4.0 国际许可协议(CC-BY 4.0)进行许可。转载请注明来源: https://snowfrs.com/2020/10/24/slurm-install.html 欢迎对文中引用进行考证,欢迎指出任何不准确和模糊之处。