HPC Batching System

Batch processing for "loosely coupled" compute cluster. eg LSF, PBS/TORQUE, SGE.
The traditional Supercomputer seems as rare as dinosours, and even supercomputing center run batch submission system like GE or SLURM or some such.
Cluster tends to refer to highly similar systems setup with intention of using it as a single system. Grid tends to cover a wider variety of system (eg workstation queues).

Note that instead of using a batch system, a number of tools exist to carry out embarrasingly parallel tasks of running the same program many times over and in parallel against a large number of input files:

Batch System 101

Think of old days mainframes where user submit job, it will run non-interactively, it will be processed and output collected at the end.
Sys admin has to worry about scheduling, fair sharing, etc. Things that OS used to do, now sys admin are exposed and may need to tweak them. Deja vu? Time to dig out that Operating System book from college :)

DanT Grid Engine for Dummies

HPC Batch system is increasingly adopting containers. For a feature comparison table of Shifter vs Singularity, and a list of relevant articles, see Docker vs Singularity vs Shifter vs Univa Grid Engine Container Edition

Command Summary

Below is a quick comparison of commands between a couple of popular batch/queue management systems. A "Rosetta Stone" of Batch/Queue systems on Unix/Linux :)
Task			LSF			PBS/Torque		SGE			SLURM
--------------------	--------------------	--------------------	--------------------	------------
Job submission		bsub			qsub			qsub			sbatch
Job monitor		bjobs			qstat			qstat -j JOBID		squeue -j JOBID
									qstat -u USERNAME	squeue -u USERNAME
Job removal								qdel			scancel ?

Job control		bstop						qhold			scontrol hold 
			bresume						qrls			scontrol release
			bpeek									sattach

interactive/Real time							qrsh / qlogin		srun  (also salloc?) 


Check nodes		bhosts			pbsnodes -a		qhost / qstat -f	sinfo -N / scontrol show nodes
			lshosts						
			lsload

List avail queues 							qconf -sql		scontrol show partition
Queue status		bqueues			qstat -q		qstat -g c		sinfo 
Queue behaviour		bparams									scontrol show partition / smap

	

User accounting		bacct									sacct

config									qconf			scontrol

GUI									qmon			sview

Other intro references:
  • Rosetta Stone of workload manager/scheduler commands (in pdf, cover PBS, Slurm, LSF, SGE, LoadLeveler)

  • Flux - https://github.com/flux-framework/flux-core
  • Intro to Slurm as presented in LBNL LabTech 2018 by yours truly :).

    SLURM

    Two page slurm cheat ssheet alt cache

    Slurm 101

    SLURM = Simple Linux Utility for Resrouce Management. But it isn't so simple anymore and it is rebranded as "Slurm", like the drink in the simpsons :)

    Job Submission

    sbatch	myscript	# submit script to run in batch mode 
    
    salloc			# run interactive job, asking scheduler to allocate certain resources to it
    salloc bash		# get a bash shell in intertive mode with req resource
    
    srun			    # Typically for MPI job, create job allocation, launch job step.
    srun --pty -p lr3           -A scs --qos=lr_normal    -N 1      -t 0:59:00     /bin/bash  # interactive bash shell for 59 min on 1 Node.
    srun --pty -p savio2_gpu    -A scs --qos savio_normal -N 1 -n 1 -t 60 --gres=gpu:1  bash  # requesting gpu node as gres
    srun --pty -p savio3_2080ti -A scs --qos savio_normal -n 2 -t 60  hostname
    
    
    sattach			# connect to the stdin, stdout and stderr of a job or job step
    
    
    
    Options common to most/all job submission commands:
    
    sbatch/salloc/srun
    	--ntasks=100	# set number of tasks to 100 (-n 100)
    	--nodes=x --ntasks-per-node=y	# less recommended way to run large MPI job
    	--time=10	# time limit in minutes (wall clock after job starts)   (also help scheduler do backfill)
    	--depend=JOBID	# job dependency, like qhold until a parent job has finished before starting this.
    
    
    	--account=NAME	# which account to use for the job
    	--qos=NAME	# what QoS to run (may affect cost, preemption, etc)
    	--label		# prepend output with task number
    	--nnodes=2	# req 2 nodes for this job (-N 2)
    	--exclusive	# have exclusive use of node when running 
    
    	--multi-prog slurmjob.conf	# run a job where diff task run diff prog, as specified in a config file given in the arg
    					# taskID starts with 0
    					# %t = task ID
    					# %o = offset within task ID range
    
    	--licenses=foo:2,bar:4		# allocate license with the job.
    	-J jobname
    
    
    example job submission with constrains (features in slurm.conf)
    
    srun --pty --partition=savio3_2080ti --account=scs --qos=geforce_2080ti3_normal -n 2 --gres=gpu:GTX2080TI:1 --constrain=ssd --time=00:30:00 bash
    srun --pty --partition=savio3_2080ti --account=scs --qos=titan_2080ti3_normal -n 2 --gres=gpu:TITAN:1       --constrain=8rtx --time=00:30:00 bash
    
    
    Example with slurm array jobs: UMBC though if input file are not named sequentially, may not have a easy way to match them, and gnu-parallel with task file as input would be more flexible, albeit a bit more complicated.

    What/Where

    Often time, a number of parameters are required to submit jobs. eg, what account I am in? what partitions are there? What QoS are there?
    
    sacctmgr show associations -p user=tin                  # see what associations a specified user has been granted, -p = "parsable"
    sacctmgr show assoc user=tin    format=acc,part,qos        # list account, partition, QoS for the specified USERNAME
    sacctmgr show assoc user=tin    format=qos%58,acc%16,user%12,part%18,share%5,priority%16    # share=1 just means fair share, it is a weight factor for job priority
    sacctmgr show assoc user=tin -p format=acc,user,part,share,priority,qos  | sed "s/|/\t/g"
    
    scontrol show partition					# list "queues"
    
    sacctmgr show qos -p format=name		    	# list QoS
    sacctmgr show qos    format=Name%24,Priority%8,Preempt%18,PreemptMode,GrpTRES%22,MinTRES%26 # savio
    sacctmgr show qos    format=name%26,Priority%8,Preempt%18,PreemptMode,MaxWall,MaxTRESPU,MaxSubmitPU,GrpTRES%22,MinTRES%22,UsageFactor,GraceTime,Flags%4 
    sacctmgr show qos    format=name%26,Priority%8,Preempt%18,PreemptMode,MaxWall,GrpTRES%22,MaxTRES%22,MinTRES%26,MaxTRESPU,MaxSubmitPU
    
    
    

    Job Control

    scancel						# send signal to cancel job or job steps (individual task array job)
    	JOBID					# cancel a job (if task array, ??all job steps)
    	JOBID.TASKID				# cancel only specific job step
    	--user=USERNAME --state=pending		# kill all pending jobs for named user
    
    sbcast				# "broadcast": xfer file to compute nodes running the job, using hierarchical communication
    				# may or may not be faster than a shared FS.
    
    sqeueue --start -j JobID	# scheduler return estimate of when job would start running 
    				# may return N/A when --time not specified for other jobs
    
    
    srun_cr				# run with BSD checkpoint/restart 
    strigger			# run arbitrary script in response to system events, eg node dies, daemon stops, job reaches time limit, etc
    				# persistent triggers will remain after event occurs
    	-c ?			# create trigger 
    	--destroy
    	--list			# list triggers ?
    
    

    System Info

    sinfo			# status of node, queue, etc
    	--Node		# output info in node-centric view (group similar nodes together)
    	-p PARTITION 	# info on specific partition (queue) 
    	--long		# add info on memory, tmp, reason, etc
    
    sinfo -R -S %E		# output sorted/grouped by Reason 
    sinfo -R -S %H		# output sorted by timestamp 
    sinfo -R -S %n		# output sorted by nodelist
    
    alias sinfo-N='sinfo --Node --format "%N %14P %.8t %E"' # better sinfo --Node; incl idle  ##slurm
        # -N is node centric, ie one node per line, has to be first arg
        # -p PARTNAME  # can add this after aliased command instead of using grep for specific queue
    
    alias sinfo-f='sinfo --Node  --format "%N %.8t %16E %f"' # Node centric info, with slurm feature 
        # %f = feature eg savio3_m1536, which is 16x more ram than savio3_m96
        # feature name change to ponder: 4x1080ti, 2xV100, 4x2080ti, 8xTitanRTX  (do we always get Ti version?) but may not want to change savio2, maybe only add to it
    
    alias sinfo-R='sinfo -R -S %E --format="%9u %19H %6t %N %E"'   # -Sorted by rEason (oper input reason=...) ##slurm
        # %E is comment/reason, unrestricted in length.  
        # once -R is used, it preced -N, but this output is good for sorting by symptoms
    
    
    
    
    squeue			# status of job and job steps, TIME means job running time so far
    	-s		# info about job steps
    	-u USERNAME	# job info of specified user
    	-t all		# all state (but only recent job only, historical info in accounting rec only)
    	-i60		# iterate output every 60 sec
    
    squeue -p PARTITION --sort N -l 	# show nodes of a partition sequentially, with long  format (elapsed time and time limit)
    
    squeue -o "%N %l"	# specify output format %N = node, %l = time left
    			# default output:      "%.18i %.9P %.8j      %.8u %.2t %.10M      %.6D %R"
    			# def for -l, --long:  "%.18i %.9P %.8j      %.8u %.8T %.10M %.9l %.6D %R"   
    			# def for -s, --steps: "%.15i      %.8j %.9P %.8u      %.9M       %N"
    
    squeue --sort N -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"     
    squeue --sort N -o "%R %.6D %.14P  %.18i %.8j %.14u %.8T %.10M %.9l" 
    # %R is node name, but could be multiple, or absent not RUNNING yet.
    # %j is job name
    # %u is username
    # %T is state info (RUNNING, PENDING, etc)
    # %l = time limit
    
    smap			# topology view of system, job and job steps.  (new tool that combine sinfo and squeue data)
    sview			# new GUI:   view + update on sys, job, step, partition, reservation, topology
    
    scontrol		# admin cli: view + update on sys, job,       partition, reservation
    			# dump all info out, not partituclarly user friendly
    scontrol show reservations	# see job/resource reservations
    scontrol show partition		# view partition/queue info
    scontrol show nodes		# view nodes info (as specified to slurm.conf)
    scontrol show config		# view config of whole system (eg when asking mailing list for help)
    scontrol update PartitionName=debug MaxTime=60	# change the MaxTime limit to 60 min for the queue named "debug"
    scontrol show jobid -dd 1234567                 # see job detail, including end time
    scontrol update   jobid=1234567 TimeLimit=5-12	# change time limit of a specific job to 5.5 days (job runtime extension, but it is not adding time to existing, nor is it setting a new finish time, but change the original wall clock limit, so be sure that it is longer than previous setting, eg default of 3 days for some queue)
    
    sacct				# view accounting by individual job and job steps (more in troubleshooting below)
    
    sacctmgr show events start=2018-01-01 node=n0270.mako0          # history of sinfo events (added by scontrol) but rEason is truncated :(
    
    
    sstat -j		# on currently running job/job step (more info avail than sacct which is historical)
    	-u USERNAME
    	-p PARTITIONNAME	
    
    sprio			# view job PRIOrity and how factors were weighted
    sshare			# view fair-SHARE info in hierarchical order
    sdiag			# stat and scheduling info (exec time, queue length).  SLURM 2.4+
    
    sreport			# usabe by cluster, partition, user, account
    			# present info in more condensed fashion
    
    

    Env Var

    SLURM will read environment variables to help it determin behavior.
    CLI argument(s) will overwrite env var definitions.

    See:
  • UMBC for most used vars
  • schedmd reference
    SQUEUE_STATE=all		# tell squeue cmd to display job in all state (incl COMPLETED, CANCELLED)
    SBATCH_PARTITION=default_q	# specify the given partition as the default queue
    

    SLURM Job / Queue State

    squeue
    Node state code
    
    State with asterisk (*) (eg idle*, draining*) means node is not reachable.
    
    state as reported by sinfo
    down*	Down, * = unreachable
    down?	Down, ? = node reserved
    drng	draining - ie job still running, but no new job will land on node
    drain	drained  - ie all jobs has been cleared
    comp	completing??
    idle	online, waiting for action
    alloc	allocated, ie running job
    mix	some cpu running job, but has more resource avail to run more jobs (non exclusive mode).
    plnd ?? planned?  supposed to be transient?
    
    
    Node state code symbol suffix
    
    $   reserved, maintenance
    *   down, unreachable
    ~   powered off
    !,% powering down
    #   being powered up
    @,^ rebooting
    -   backfilling by scheduler by higher priority job
    
    
    JOB REASON CODES
    Cleaning
    NodeDown 
    
    
    JOB STATE CODES - reported by squeue
    PD	PENDING
    CG	Completing
    CD	Completed
    TO	timeout
    F	Failed
    NF	Node_FAIL
    OOM	OUT_OF_MEMORY
    R	RUNNING
    ST	STOPPED
    TO	TIMEOUT	-- typically time limit exceeded, at the QoS for the partition, user or group level.
    
    
    
    ref: sinfo state code

    SLURM Troubleshooting

    sinfo  | grep n0033.savio2	# what's a node state
    squeue | grep n0164.savio2	# whether the node running any job
    
    
    sacctmgr show associations	# list all associations.  think of them as acl.  
    				# it is the union of all entities in each association entry that govern if user X has access to resource Y (nodes operating under certain qos)
    sacctmgr show associations -p user=sn	# association for specified user
    					# -p use | to separate columns, reduce space usage.
    
    sacctmgr show qos	-p	# show all avail Quality of Service entitites 
    sacctmgr show qos    format=Name%32,Priority%8,Preempt%18,PreemptMode,UsageFactor,GrpTRES%22,MaxTRES%22,MinTRES%22
    
    # Example qos:
    Name                     |Priority|GraceTime|Preempt      |PreemptMode|||UsageFactor|GrpTRES||||||MaxTRES||||||||||MinTRES|
    genomicdata_savio3_bigmem| 1000000| 00:00:00|savio_lowprio|cluster    |||1.000000   |node=4 ||||||node=4 ||||||||||cpu=1  |
    genomicdata_savio3_xlmem | 1000000| 00:00:00|savio_lowprio|cluster    |||1.000000   |node=1 ||||||node=1 ||||||||||cpu=1|
    
    
    squeue				# full detail of current job queue (all user, all jobs, on all partitions/queues)
    
    squeue -p stanium 		# queue for names partition
    				# State: (R)unning, PenDing.
    				# Reason: Resource = waiting for resource eg free node
    				# 	  Priority = other higher priority job are scheduled ahead of this job
    
    sinfo -t I 			# list staTe Idle nodes 
    sinfo -t I -p stanium		# Idle nodes in stanium partition
    
    squeue --start			# estimated start time of each queued job 
    				# N/A if time cannot be estimated (eg non time bound jobs exist)
    
    
    scontrol show reservation 	# show reservations
    				# reservation are typically maintenance window placed by admins
    				# but sys admin can also reserve to let user then run certain jobs in the reservation window.
    				# jobs will not start in such reservation window
    				# running job will NOT be killed by slurm if run into the reservation
    				# listed user of the reservation can submit job that is flagged to use the reservation
    
    
    scontrol create reservation starttim=2018-01-02T16:15 duration=60 user=tin flags=maint,IGNORE_JOBS nodes=n0142.cf1,n0143.cf1 reservation=rn_666
    scontrol create reservation starttim=2018-01-02T16:15 duration=10-0:0:0 user=tin flags=maint nodes=n0142.cf1,n0143.cf1 reservation=rn_66r76
    scontrol create reservation start=2018-01-10T12:25 end=2018-01-13 user=tin,meli flags=maint nodes=n0018.cf1,n0019.cf1 reservation=resv777
    # create reservation, admin priv may be req
    # duration is number of minutes
    # duration number of 10-0:0 is 10 days, 0 hours, 0 min.  can be abbreviated to 10-0 or even 10-
    # reservation=rn_666 is optional param to give reservation a name, 
    # if not specified, slurm will generate a name for it
    # nodes=lx[00-19] to specify a range should be doable, but so far have not been able to n00[00-19].cf1.  slurm too old?
    # can have overlapping reservation (eg same set of nodes reserved with different name across time that have overlap
    # presumably then scheduler accept job as per reservation, then schedule as a resource sharing exercise.
    
    scontrol update reservation=rn_666 EndTime=2020-02-03T11:30
    # change end time of reservation
    
    scontrol delete reservation=rn_666
    
    
    sbatch --reservation=rn_666 ...	# submit a job run under reservation
    
    
    Job & Scheduling
    
    sacct				# view accounting by individual job and job steps 
    sacct -j 11200077 		# info about specific job
    sacct -j 11200077 --long -p 	# long view, pipe seprated columns
    
    sacct --helpformat		# list fields that can be displayed
    				# --long and --format are now mutually exclusive
    
    # PD = PENDING, see what resources has been requested
    sacct -u USERNAME -s PD  -p         --format jobid,jobname,acc,qos,state,Timelimit,ReqCPUS,ReqNodes,ReqMem,ReqGRES,ReqTres 	
    
    # timelimit will give idea how long job will run
    sacct -s running  -r QNAME          --format jobid,timelimit,start,elapsed,state,user,acc,jobname				
    
    # when will a job start
    sacct -u USERNAME --partition QNAME --format jobid,jobname,acc,qos,state,DerivedExitCode,Eligible,Start,Elapsed,Timelimit
    
    # NNodes = Allocated or Requested Nodes, depending on job state
    sacct -u USERNAME -r QNAME -p       --format jobid,jobname,acc,qos,state,AllocNodes,AllocTRES,Elapsed,Timelimit,NNodes		
    
    # no data on these max ...
    sacct -X -r QNAME -o jobid,User,Acc,qos,state,MaxDiskWriteNode,MaxDiskReadNode,MaxPagesNode,MaxDiskWriteTask,MaxVMSize,MaxRSS
    
    # MaxRSS will be collected when slurm.conf has clause 'JobAcctGatherType=jobacct_gather/cgroup' 
    # ref: https://slurm.schedmd.com/cgroups.html
    # seff script likely depend on this as well 
    export SACCT_FORMAT="JobID%20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32,MaxVMSize"
    sacct -j $JOBID		# would report max memory usage under MaxRSS, when configured
    
    # what node a specific job ran on:
    sacct -o job,user,jobname,NodeList,AllocNode,AllocTRES,DerivedExitCode,state,CPUTime,Elapsed  -j 10660573,10666276,10666277,10666278
    
    
    
    # example of querying jobs ran in a certain time frame, in a specific partition (-r or --partition= )
    sacct -p -X -a -r savio3_gpu  -o job,start,end,partition,user,account,jobname,AllocNode,AllocTRES,DerivedExitCode,state --start=2021-12-28T23:59:59 --end=2023-01-01T23:59:59
    
    # find jobs on a specific node.   
    # -N == --nodelist   man page says can take node range, but n0[258-261] didn't work
    sacct -p  -a --nodelist=n0176.savio3  -o job,start,end,partition,user,account,jobname,AllocNode,AllocTRES,DerivedExitCode,state --start=2022-10-11T23:59:59 --end=2023-01-31T23:59:59
    
    # find jobs on a specific node with some sort of fail state
    sacct -p  -a -s F,NF,CA,S,PR,TO,BF --nodelist=n0176.savio3  -o job,start,end,partition,user,account,jobname,AllocNode,AllocTRES,DerivedExitCode,state --start=2022-10-11T23:59:59 --end=2023-01-31T23:59:59
    
    
    
    # all users, failure states, in a certain time frame (hung job not listed here; end time required or get no output):
    sacct -a -X -s F,NF -r lr5  -S 2019-12-16  -E 2019-12-18
    
    # Node Failure, job would have died (-a = all users, may req root):
    sacct -a -X -s NF -r lr5  -o job,start,end,partition,user,jobname,AllocNode,AllocTRES,DerivedExitCode,state,CPUTime,Elapsed -S 2017-12-20  -E 2018-01-01
    
    # many problem states (but may not be comprehensive), for all users 
    sacct -a  -X -s F,NF,CA,S,PR,TO,BF -r lr5  -o job,start,end,partition,user,account,jobname,AllocNode,AllocTRES,NNodes,DerivedExitCode,state -S 2017-12-25  -E 2017-12-31
    
    
    # get an idea of failed/timeout jobs and nodes they ran on
    # SystemComment,Comment are usually blank
    # Reason no longer avail?
    # but may not be the last job that ran on the node, other jobs could have subsequently ran on the node...
    # specifying End time required even when not asked for it to be in output list (else get empty list)
    sacct -S 2019-12-21 -E 2020-01-01 -a -X -s F,NF,TO -p -o state,job,start,Elapsed,ExitCode,jobname,user,AllocNode,NodeList,SystemComment,Comment | sed 's/|/\t/g' | less
    
    
    
    # %n expand width of specific field.  submit time is when slurm create all the job log entries
    sacct -j 7441362 -S 2021-02-01 -E 2021-02-17 --format submit,start,end,Elapsed,jobid%18,state%16,ExitCode,user,account,partition,qos%15,timelimit,node%35
    
    -a = all
    -j or -u would limit to job id or user
    -X = dont include subjobs (default would include them)
    slurm exit state are known to be all over the place and may not represent true reason why a job ended.
    
    # what job ran on a given node
    ?? sacct -N n0[000-300].lr4
    ?? get empty list :(
    
    

    Slurm Admin stuff

    Managing node state
    sinfo -R									# see what nodes are down
    scontrol update node=n0001.savio1,n0002.savio2 state=down reason="bad ib" 	# manually down a node 
    scontrol update node=n0001.savio1,n0002.savio2 state=resume 		 	# revive a node
    
    
    Managing user association
    
    sacctmgr add  = sacctmgr create
    sacctmgr list = sacctmgr show
    
    # "create association" is
    # really updating user entry, adding parition it can use, qos, etc
    sacctmgr add user tin Account=scs  Partition=alice0 qos=condo_gpu3_normal,savio_lowprio  # if no qos is specified, some default will be created?  dont tend to have desirable result
    sacctmgr delete user tin Account=scs Partition=savio3,savio3_bigmem
    sacctmgr add user tin Account=scs  Partition=savio3,savio3_bigmem qos=normal,savio_debug,savio_lowprio,savio_normal
    
    for User in "scoggins kmuriki wfeinstein"; do 
    	sacctmgr add user $User Account=scs  Partition=savio3,savio3_bigmem qos=normal,savio_debug,savio_lowprio,savio_normal
    done
    
    
    # if need to do changes, modify user dont seems to be very fine tuned.  
    # maybe better off removing the very specific association and recreate it
    # but be careful to specify all param of the delete least it delete more assoc entries than desired.
    sudo sacctmgr delete user where user=nat_ford account=co_dw partition=savio3_gpu
    sudo sacctmgr add user nat_ford account=co_dw partition=savio3_gpu QOS=dw_gpu3_normal,savio_lowprio DefaultQos=dw_gpu3_normal
    
    

    Slurm QoS config

    use sacctmgr command, not editing slurm.conf
    Tracking RAM usage in addition to CPU (htc partition with per-core scheduling).
    sacctmgr create qos       name=tintest_htc4_normal     priority=10      preempt=savio_lowprio preemptmode=cluster UsageFactor=1.000000 GrpTRES='cpu=224,Mem=2060480' MinTRES='cpu=1'
    
    # change exising QoS:
    sacctmgr modify qos where name=tintest_htc4_normal set priority=1000000 preempt=savio_lowprio preemptmode=cluster UsageFactor=1.000000 GrpTRES='cpu=224,Mem=2060480' MinTRES='cpu=1'
    sacctmgr remove qos where name=tintest_htc4_normal
    
    the Mem= should have value that is multiple of nodes * RealMemory clause in slurm.conf.  Value in MB.
    
    to remove any specific TRES from having a Min req, set its value to -1.  eg:
    sacctmgr modify qos where name=gpu3_normal set MinTRES='cpu=-1'
    
    
    Tracking GPU (GRES) is a bit tickier, especially when there are multiple types:
    slurm.conf:
    change
    AccountingStorageTRES=Gres/gpu
    to
    AccountingStorageTRES=Gres/gpu,license/2080,license/3080,license/rtx,license/v100
    
    then add to qos so select user can use such gpu "license"
    
    
    sacctmgr add qos savio_lowprio 
    sacctmgr add qos geforce_2080ti2_normal Preempt=savio_lowprio GrpTRES="cpu=32,gres/gpu:GTX2080TI=16" MinTRES="cpu=2,gres/gpu=1" Priority=1000000  # did not enforce job land on gtx2080ti 
    sacctmgr add qos geforce_2080ti2_normal Preempt=savio_lowprio GrpTRES="cpu=32,gres/gpu:GTX2080TI=16" MinTRES="cpu=2,gres/gpu:gtx2080ti=1" Priority=1000000  # seems to enforce
    
    sacctmgr add qos titan_2080ti3_normal Preempt=savio_lowprio GrpTRES="cpu=32,gres/gpu:TITAN=16" MinTRES="cpu=2,gres/gpu:titan=1" Priority=1000000  # seems to enforce, but must specify this MinTRES
    sacctmgr modify qos titan_2080ti3_normal set GrpTRES="cpu=32,gres/gpu:TITAN=16" MinTRES="cpu=2,gres/gpu:titan=1" Priority=1000000  # seems to enforce, but must specify this MinTRES
    
    sacctmgr add qos titanB_2080ti3_normal Preempt=savio_lowprio GrpTRES="cpu=32,gres/gpu:TITAN=16" MinTRES="cpu=2" Priority=1000000  # no MinTRES for gres
    sacctmgr add qos titanC_2080ti3_normal Preempt=savio_lowprio GrpTRES="cpu=32,gres/gpu:TITAN=16" Priority=1000000  # no MinTRES
    
    sacctmgr modify user tin set qos+=titanC_2080ti3_normal
    
    
    test with (repeat to see if land elsewhere, may have to create congestion):
    srun --pty --partition=savio3_2080ti --account=scs --qos=geforce_2080ti3_normal -n 2 --gres=gpu:GTX2080TI:1 --time=00:30:00 bash
    srun --pty --partition=savio3_2080ti --account=scs --qos=titan_2080ti3_normal -n 2 --gres=gpu:TITAN:1 --time=00:30:00 bash
    
    but if don't specify subtype, the job will get stuck as PENDING QOSMinGRES
    
     
    
    per https://slurm.schedmd.com/resource_limits.html
    bottom section on "Specific Limits Over Gres", 
    slurm design limitation can't enforce subtypes are used.  A lua plugin script could help.  
    But then that create forced fragmentation and won't let user run on varying gpu types.
    
    An alternative is to use license to give user preferential access...
    
    sacctmgr add qos lic_titan_normal Preempt=savio_lowprio GrpTRES="cpu=192,gres/gpu:TITAN=96,license/titan=96" MinTRES="cpu=2,gres/gpu=1,license/titan=1" Priority=1000000
    sacctmgr add qos lic_titan_normal Preempt=savio_lowprio GrpTRES="cpu=192,gres/gpu:TITAN=96,license/titan=96" MinTRES="cpu=2,license/titan=1" Priority=1000000
    sacctmgr add qos lic_titan_normal Preempt=savio_lowprio GrpTRES="cpu=192,license/titan=96" MinTRES="cpu2,gres/gpu=1" Priority=1000000
    
    sacctmgr add qos lic_naomi_normal Preempt=savio_lowprio GrpTRES="cpu=8,gres/gpu:3080=8,license/3080=8" MinTRES="cpu=1,license/3080=1" Priority=1000000
    
    hmm... the partition need to be updated to say they require the license TRES...
    
    
    Other places to restrict user access...
    
    slurm.conf PartitionName def, has a AllowGroups, AllowAccounts, AllowQos that govern who can use this partition.
    Partition Qos.  
    
    (similar thing may exist for Node defintion, or is that only for "FrontEnd" nodes?  doc is not clear)
    
    sacctmgr show tres
    
    
    

    MPI job in slurm

    See schedmd mpi_guide.html
    Some use mpirun, other use SLURM env var to see what resource are allocated to the job.

    Slurm Daemons



    slurmctld 	# cluster daemon, one per cluster., optional backup daemon for HA
    		# manage resource and job queue
    slurmd		# node daemon, one per compute node
    		# launch and manage task
    		# optionally, the node daemon can use hierarchical tree (multi-levels), where
    		# leaf nodes does not connect directly to slurmctld, for better scalability.
    slurmd -C 	# print out RealMemory, cpu/core config, etc needed for slurm.conf
    
    
    slurm*d -c	# clear previous state, purge all job, step, partition state
    	-D	# NOT daemon mode, logs written to console
    	-v	# verbose
    	-vvv	# very very verbose
    
    ??
    	-p debug	# partition debug
    
    
    slurmd -C	# print node config (number of cpu, mem, disk space) and exit
    		# output can be fed into slurm config file
    
    
    slurmstepd	# slurm daemon to shepherd a job step
    		# when job step finishes, this process terminates
    
    
    slurmDBD	# slurm db daemon, interface with slurm admin tools and MySQL.
    mysqld		# db hosting accounting and limit info, jobs, etc.
    
    Add slurm/install_linux/bin to PATH.
    For testing, open several window and run the daemons with -Dcvv arguments. Each addtional -v would about double the log messages.

    SLURM DB

    DB powering slurm settings is quite complex, but very powerful. Also keep user accounting info.
    sacctmgr		# add/delete clusters, account, user
    			# get/set resource limit, fair-share, max job count, max job size, QoS, etc
    			# based on hierarchical banks
    			# Many slurm db update done via this cmd
    			# updates pushed out live to slurmd (or slurmctld?)
    			# hierarchical banks: Division, each with diff groups, which in term has its own set of users.
    
    Root = ? whole site, supercluster spanning all partitions.  
    Division = Partitions?  (eg lr5, dirac1, cf1, etc)    QOS??
    Group = association level.  eg Group Beta gets GrpCPUs, GrpJobs, GrpMemory.
    User = specific user account.
    
    They mention Hierarchical, does this means associations can be nested?  Never seen example of that, maybe not.
    
    

    Slurm Hierarchical bank example from SUG 2012 (page 7, 13)
    Slurm Hierarchical bank example diagram
    The order in which limits are enforeced is:
    1. Partition QOS limit
    2. Job QOS limit
    3. User association
    4. Account association(s), ascending the hierarchy
    5. Root/Cluster association
    6. Partition limit
    ref: ceci-hpc

    Admin Config

    1. Install RPM to all nodes
    2. Must configure node grouping in partition
    3. Most other config param has usable defaults
    4. Web GUI tool in doc/html/configurator.html
    5. Use Google MUNGE for authentication. Each node need to have MUNGE auth key configured, need a munge daemon running on each node.
    6. testsuite/expect has ~300 test programs to validate installation. edit globals.local to select/config tests.

    slurm/install_linux/etc/slurm.conf has default template that is almost usable.
    DB and accounting not configured in this default template
    Need to change 2 user name and HOSTNAME. User is typically root, but in test, could use personal account.
    see slurm.conf man page for details

    
    ControlMachine 			# use output from hostname -s 
    NodeName			# in node 	spec section
    Nodes				# in partition  spec section
    SlurmUser 			# slurmctld user.  use output from id -un
    slurmdUser 			# slurmd    user. 
    ...
    
    
    Can try this for rolling reboot:
    scontrol reboot asap nextstate=resume reason=“reboot to upgrade node” n0052.hpc0,n0064.hpc0,n0065.hpc0,n0066.hpc0,n0067.hpc0,n0070.hpc0,n0071.hpc0,n0074.hpc0,n0076.hpc0,n0081.hpc0,n0083.hpc0,n0084.hpc0,n0085.hpc0,n0087.hpc0
    
    Ref: Large cluster admin guide

    cgroup & slurm

    
    grep -r [0-9] /cgroup/cpuset/slurm
    
    /cgroup/cpuset/slurm/uid_43143/job_13272521/step_batch/cpuset.mem_hardwall:0
    /cgroup/cpuset/slurm/uid_43143/job_13272521/cpuset.mem_hardwall:0
    /cgroup/cpuset/slurm/uid_43143/cpuset.mem_hardwall:0
    /cgroup/cpuset/slurm/cpuset.mem_hardwall:0
    
    grep  [0-9] /cgroup/memory/slurm/*
    /cgroup/memory/slurm/memory.limit_in_bytes:9223372036854771712
    /cgroup/memory/slurm/memory.max_usage_in_bytes:2097704960             # 2G
    
    /cgroup/memory/slurm/uid_43143/memory.max_usage_in_bytes:1216872448   # 1.1G
    /cgroup/memory/slurm/uid_43143/job_13272521/memory.max_usage_in_bytes:1216872448
    
    
    # very often
    # mount -t /cgroup ... /sys/fs/cgroup
    
    slurm process are echo into cgroup task file.  eg:
    tin      15537     1  0 15:05 ?        00:00:07 /global/software/sl-7.x86_64/modules/langs/java/1.8.0_121/bin/java -cp /global/scratch/tin/spark/spark-test.13272521/conf/:/global/software/sl-7.x86_64/modules/java/1.8.0_121/spark/2.1.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host n0093.lr3 --port 7077 --webui-port 8080
    
    grep -r 15537  *
    
    cpuset/slurm/uid_43143/job_13272521/step_batch/cgroup.procs:15537
    cpuset/slurm/uid_43143/job_13272521/step_batch/tasks:15537
    
    memory/slurm/uid_43143/job_13272521/step_batch/cgroup.procs:15537
    memory/slurm/uid_43143/job_13272521/step_batch/tasks:15537
    
    freezer/slurm/uid_43143/job_13272521/step_batch/cgroup.procs:15537
    freezer/slurm/uid_43143/job_13272521/step_batch/tasks:15537
    
    
    
    
    

    slurm Assistant

    .bashrc functions and alias
            alias sevents="sacctmgr show events start=2018-01-01T00:00"  # node=n0270.mako0 # history of sinfo events (added by scontrol)
            alias assoc="sacctmgr show associations -p"
    

    SLURM Ref

    SLURM TRES Ref

    SGE/UGE

    SGE = Sun Grid Engine. Son of a Grid Engine
    OGE = Open Grid Engine. And NOT Oracle GE !! They eventually sold all GE stuff to Univa anyway.
    UGE = Univa Grid Engine.
    GE = Grid Engine.

    User Howto

    Find out what queue is avail:
    qconf -sql 
    
    Submit a job to a specific queue:
    qsub sge-script.sh -q 10GBe.q		# eg run only in a queue named for nodes with 10 gigE NIC
    
    qlogin -q interactive.q			# often an interactive queue may have more relaxed load avg req than same host on default.q or all.q
    					# but limited slots to avoid heavy processing in them
    					# yet granting a prompt for lightweight testing env use even when cluster is fully busy
    
    
    qsub -R y -pe smp 8 ./sge-script.sh     # -R y : Reservation yes... avoid starvation in busy cluster
                                            # -pe smp 8 : req run with a Physical Env of 8 core (smp)
    
    
    
    
    
    
    qrsh
    
    
    ### examples of qsub with request for extra resources ###
    
    # Find out pe env, eg smp, mpi, orte (openmpi, most widely used mpi) 
    qconf -spl
    
    # Describe detail of a pe env: 
    qconf -sp mpi
    qconf -sp smp
    
    
    # Submit a job to a queue with certain cores numbers
    qsub sge-script.sh -pe smp 8		# this req 8 core from machine providing smp parallel env.
    
    qsub sge-script.sh -q default.q@compute-4-1	# submit to a specific, named node/host.    also see -l h=...
    
    
    # an example that takes lot of parameters
    # -l h_rt specifies hard limit stop for the job (for scheduler to know what's ceiling time of run job)
    # -l m_mem_free specifies free memory on node where job can be run.  
    #               seems to be mem per core, so if specify 8GB and -pe smp 8, then it would need a systemwith 64GB free RAM
    # -pe smp 8  specify to use the symetric multiprocessing processing Parallel Environment, 8 cores
    # -binding linear:8  is to allow multiple threads of the process to use multiple cores (some uge or cgroup config seems to default bind to single core even when pe smp 8 is specified.  
    #    incidentally, load average counts runnable jobs.  so even if 6 threads are bound to a single core, 
    #    load avg maybe at 6 but most other core sits idle as visible by htop.
    qsbu -M me@example.com -m abe -cwd -V -r y -b y -l h_rt=86400,m_mem_free=6G -pe smp 8 -binding linear:8 -q default.q -N jobTitle  binaryCmd -a -b -c input.file 
    
    
    
    
    
    
    # Find out what host group are avail, eg host maybe grouped by cpu cores , memory amount.
    qconf -shgrpl
    
    # Find out which specific nodes are in a given host group:
    qconf -shgrp @hex
    
    qsub sge-script.sh -q default.q@@hex	# @@groupname submits to a group, and the group can be defined to be number of cores.  not recommended.
    
    
    # Change the queue a job is in, thereby altering where subsequent job is run (eg in array job)
    qalter jobid -q default.q@@hex
    
    
    

    Basic User Commands

    
    qsub sge-script.sh	# submit job (must be a script)
    			# std out and err will be captured to files named after JOBid.
    			# default dir is $HOME, unless -cwd is used.
    			
    qsub -cwd -o ./sge.out -e ./sge.err  myscript.sh
    			# -cwd = use current working dir (eg when $HOME not usable on some node)
    			# -o and -e will append to files if they already exist.
    
    qsub	-S /bin/bash	# tell SGE/qsub to parse the script file using specifed SHELL
    	-q queue-name	# submit the job to the named queue.
    			# eg default@@10gb  would seelct default queue, with 10gb (complex) resource available
    	-r y		# resume job (yes/no).  Typically when a job exit with non-zero exit code, rerun the job.
    	-R y		# use Reservation. useful in busy cluster and req > 1 core.  
    			# hog system till all cores become avail.  
    			# less likely to get starvation, but less efficient.  
    
    	-N job-name	# specify a name for the job, visible in qstat, output filename, etc.  not too important.
    
        -pe smp 4		# request 4 cpu cores for the job 
        -binding linear:4	# when core binding is defaulted to on, this allow smp process to be able to use the desired number of core
        -l ...		# specify resources shown in qconf -sc
        -l h_rt=3600	# hard time limit of 3600 seconds (after that process get a kill -9)
        -l m_mem_free=4GB	# ask for node with 4GB of free memory.
    			# if also req smp, memory req is per cpu core
    
    	
    
    qsub 	-t 1-5		# create array job with task id ranging from 1 to 5.
    			# note: all parameters of qsub can be placed in script 
    			# eg,  place #$ -t 1-5 in script rather than specify it in qsub command line 
    
    
    qstat 	
    	-j		# show overall scheduler info
    	-j JOBID	# status of a job.  There is a brief entry about possible errors.
    	-u USERNAME	# list of all pending jobs for the specified user
    	-u \*		# list all jobs from all users
    	-ext		# extended overview list, show cpu/mem usage
    	-g c		# queues and their load summary
    	-pri		# priority details of queues
    	-urg		# urgency  details
    	-explain A	# eshow reason for Alarm (also E(rror), c(onfig), a(larm) state)
    	-s r		# list running jobs
    	-s p		# list pending jobs
    
    	
    Reduce the priority of a job:
    qalter JOBID -p -500
    	# -ve values is DEcrease priority, lowest is -1024
    	# +ve is INcrease priority, only admin can do this.  max is 1024
    	# It really just affect pending jobs (eg state in qw).  won't preempt running jobs.
    qmod -u USERNAME -p -100
    	# DEcrease job priority of all jobs for the specified user 
    	# (those that have been qsub, not future jobs)
    
    
    
    
    qdel JOBID		# remove the specified job
    qdel -f JOBID		# forcefully delete a job, for those that sometime get stuck in "dr" state and qdel returns "job already in deletion"
    			# would likely need to go to node hosting job and remove $execd_spool_dir/.../active_jobs/JOBID 
    			# else may see lot of complains about zombie job in qmaster message file
    
    
    
    

    Job STATE abbreviations

    They are in output of qstat
    	E	= Error
    	qw	= queued waiting
    	r	= running
    	dr	= running, died  (eg, host no longer responding)
    	s	= 
    	t 	= terminating?  jobs that exceeded threashold of h_rt, but somehow sge_execd had problem contecting with scheduler, had resulted in this.
    	ts	= ??
    	h	= hold	(eg qhold, or job depency tree to hold)
    
    
    qstat -f  shows host info.  last column (6) is state, and when in problem, has things like au or E on it.
    au state means alarm, unreachable.  typically host is transiently down or SGE_execd is not running.  
    clearable automatically when sge can talk to the node again.
    E is hard Error state, and even with reboot, will not go away.  
    
    qmod -cq  default.q@HOSTNAME 	# clear Error state, do this after sys admin knows problem has been addressed.
    qmod -e   default.q@HOSTNAME	# enable node, ie, remove the disabled state.  should go away automatically if reboot is clean.
    
    

    Here is a quick way to list hosts in au and/or E state:
    qstat -f | grep au$								# will get most of the nodes with alarm state, but misses adu, auE...
    qstat -f | awk '$6 ~ /[a-zA-Z]/ {print $0}'                                     # some host have error state in column 6, display only such host
    qstat -f | awk '$6 ~ /[a-zA-Z]/  &&  $1 ~ /default.q@compute/  {print $0}'      # add an additional test for nodes in a specific queue
    
    alias qchk="qstat -f | awk '\$6 ~ /[a-zA-Z]/ {print \$0}'"				# this escape seq works for alias
    alias qchk="qstat -f | awk '\$6 ~ /[a-zA-Z]/  &&  \$1 ~ /default.q@/  {print \$0}'"	# this works too
    
    (o state for host: when hosts is removed from @allhosts and no longer allocatable in any queue)

    Basic+ User Commands

    qmon			# GUI mon 
    
    
    qsub -q default.q@compute-4-1 $SGE_ROOT/examples/jobs/sleeper.sh 120
    			# run job on a specific host/node, using SGE example script that sleeps for 120 sec.
    			 
    
    qsub -l h=compute-5-16 $SGE_ROOT/examples/jobs/sleeper.sh 120 	
    			# run job on a specific host/node, alt format
    qsub -l h=compute-[567]-* $SGE_ROOT/examples/jobs/sleeper.sh 120 	
    			# run on a set of node.  Not truly recommended, using some resource or hostgroup would be better.
    
    
    qconf 
    	-sc		# show complex eg license, load_avg
    	-ss		# see which machine is listed as submit host
    	-sel		# show execution hosts
    	-sql 		# see list of queues   (but more useful to do qstat -g c)
    
    
    
    qhost			# see list of hosts (those that can run jobs)
    			# 	their arch, state, etc info also
          -F		# Full details
    qhost -F -h compute-7-1 # Full detail of a specific node, include topology, slot limit, etc.
    
    

    qstat -f

    qstat	-f		# show host state  (contrast with qhost)
    			# column 6 contain state info.  Abbreviation used (not readily in man page):
    			# upper case = perpanent error state
    
    			a  Load threshold alarm
    			o  Orphaned
    			A  Suspend threshold alarm
    			C  Suspended by calendar
    			D  Disabled by calendar
    
    			c  Configuration ambiguous
    			d  Disabled
    			s  Suspended
    			u  Unknown
    			E  Error
    
    qstat -f | grep d$	# disabled node, though some time appear as adu
    qstat -f | awk '$6~/[cdsuE]/ && $3!~/^[0]/'
    			# node with problematic state and job running (indicated in column 3) 
    
    
    Sample submit script
    
    #!/bin/bash
    
    #$ -S /bin/bash
    #$ -cwd
    #$ -N sge-job-name
    #$ -e ./testdir/sge.err.txt
    #$ -m beas
    #$ -M tin@grumpyxmas.com
    
    ### #$ -q short_jobs.q
    
    ### -m define when email is send: (b)egin, (e)nd, (a)bort, (s)uspend
    ### -M tin@grumpyxmas.com work after specifiying -m, (w/o -m don't send mail at all)
    ### if don't specify -M, then email will be send to username@mycluster.local
    ### such address will work if mailer supports it 
    ### (eg postfix's recipeint-canonical has mapping for
    ### @mycluster.local @mycompany.com
    
    
    ## bash commands for SGE script:
    
    echo "hostname is:"
    hostname
    
    
    This is a good ref for sge task array job: http://talby.rcs.manchester.ac.uk/~ri/_linux_and_hpc_lib/sge_array.html
    
    #!/bin/bash
    
    #$ -t 1-100
    
    #  -t will create a task array, in this eg, element 1 to 100, inclusive
    #  The following example extract one line out of a text file, line numbered according to SGE task array element number
    #  and pass that single line as input to a program to process it:
    
    INFILE=`awk "NR==$SGE_TASK_ID" my_file_list.text`
          # instead of awk, can use sed:   sed -n "${SGE_TASK_ID}p" my_file_list.text
    ./myprog < $INFILE
    
    #
    # these variables are usable inside the task array job:
    # $SGE_TASK_ID
    # $SGE_TASK_FIRST
    # $SGE_TASK_LAST
    # $SGE_TASK_STEPSIZE, eg in 1-100:20, which produces 5 qsubs with 20 elements each
    #
    # JOB_ID
    # JOB_NAME
    # SGE_STDOUT_PATH
    #
    
    For other variables available inside an SGE script, see the qsub man page
    
    #!/bin/bash
    
    #$ -V	# inherite user's shell env variable (eg PATH, LD_LIBRARY_PATH, etc)
    #$ -j y	# yes to combine std out and std err
    
    

    Admin Commands

    
    qalter ...		# modify a job
    qalter -clear ...	# reset all elements of job to initial defaul status... 
    qmod   -cj JOBID	# clear an error that is holding a job from progressing
    qhold      JOBID	# place a job on hold  (aka qalter -hold_jid JOBID)
    			# for qw job, hold prevents it from becomming to r state
    			# for running job, probably equiv of kill -suspend to the process.  free up cpu, but not memory or licenses.
    qrls   -hn JOBID 	# release a job that is on hold (ie resume)
    
    qacct  -j  JOBID	# accounting info for jobs
    qacct -j		# see all jobs that SGE processed, all historical data
    qacct -o USERNAME	# see summary of CPU time user's job has taken
    
    
    qrsh -l mem_total=32G -l arch=x86 -q interactive		# obtain an interative shell on a node of specified char
    
    qconf			# SGE config
    	-s...		# show subcommand set
    	-a...		# add  subcommand set
    	-d...		# delete
    	-m...		# modify
    
    qconf -sq [QueueName]	# show how the queue is defined (s_rt, h_rt are soft/hard time limit on how long jobs can run).
    qconf -sc		# show complexes
    qconf -spl		# show pe
    qconf -sel		# list nodes in the grid
    qconf -se HOSTNAME	# see node detail
    qconf -sconf		# SGE conf
    qconf -ssconf		# scheduler conf
    
    qconf -shgrpl		# list all groups
    qconf -shgrp @allhosts	# list nodes in the named group "@allhosts"
          -shgrp_tree grp	# list nodes one per line
    
    qconf -su site-local	# see who is in the group "site-local"
    qconf -au site-local sn	# add user sn to the group called "site-local"
    
    qconf -sq highpri.q	# see characteristics of the high priority queue, eg which group alloed to submit job to it.
    			# hipri 	adds urgency of 1000
    			# express pri 	adds urgency of 2000
    			# each urgency of 1000 adds a 0.1 to the priority value shown in qstat.
    			# default job is 0.5, hipri job would be 0.6
    
    
    qconf -spl		# show PE env 
    
    qconf -sconf	# show global config 
    qconf -ssconf	# show scheduler config (eg load sharing weight, etc)
    
    Enable fair sharing (default in GE 6.0 Enterprise):
    qconf -mconf 	# global GE config
    	enforce_user		auto
    	auto_user_fshare	100
    qconf -msconf	# scheduler config
    	weight_tickets_functional 10000
    
    	# fair share only schedule job within its 0.x level
    	# each urgency of 1000 add 0.1 to the priority value, 
    	# thus a high priority job will be be affected by fair share of default job.
    	# any 0.6xx level job will run to completion before any 0.5xx job will be scheduled to run.
    
    	# eg use, can declare project priority so any job submitted with the project code will get higher priority and not be penalized by previous run and fair share...
    	
    
    qconf -tsm	# show scheduler info with explanation of historical usage weight 
    		# very useful in share-tree scheduling 
    		# (which reduce priority based on historical half-life usage of the cluster of a given user)
    		# straight forward functional share ignore history (less explaining to user)
    		# It is a single-time run when the scheduler next execution
    		# log should be stored in    $sge_root/$cell/common/schedd_runlog  
    
    
    qping ...
    
    tentakel "ping -c 1 192.168.1.1" 	
    	# run ping aginast head node IP on all nodes (setup by ROCKS)
    
    
    

    Disabling an Execution Host

    Method 1
    If you're running 6.1 or better, here's the best way.
    1. Create a new hostgroup called @disabled (qconf -ahgrp @disabled).
    2. Create a new resource quota set via "qconf -arqs limit hosts @disabled to slots=0" (check with qconf -srqs).
    3. Now, to disable a host, just add it to the host group ( qconf -aattr hostgroup hostlist compute-1-16 @disabled ).
    4. To reenable the host, remove it from the host group ( qconf -dattr hostgroup hostlist compute-1-16 @disabled ).
    5. Alternatively, can update disabled hosts by doing qconf -mhgrp @disabled, placing NONE in the hostlist when there are no more nodes in this disabled hostgroup.
    This process will stop new jobs from being scheduled to the machine and allow the currently running jobs to complete.
    1. To see which nodes are in this disabled host group: qconf -shgrp @disabled
    2. The disabled hosts rqs can be checked as qconf -srqs and should looks like
        {
         name         disabled_hosts
         description  "Disabled Hosts via slots=0"
         enabled      TRUE
         limit        hosts {@disabled} to slots=0
        }
        
    Ref: http://superuser.com/questions/218072/how-do-i-temporarily-take-a-node-out-from-sge-sun-grid-engine
    Method 2
    qmod -d \*@worker-1-16 		# disable node compute-1-1 from accepting new jobs in the any/all queues.
    qmod -d all.q@worker-1-16 	# disable node compute-1-1 from accepting new jobs in the all.q queue.
    qmod -d       worker-1-16 	# disable node compute-1-1, current job finish, no new job can be assigned.
    qstat -j 			# can check if nodes is disabled or full from scheduler consideration
    qstat -f 			# show host status.  
    qstat -f | grep d$   		# look for disabled node (after reboot, even if crashed by memory, sge/uge disable node)
    qmod -e  all.q@worker-1-16	# re-enable node
    qmod -cq default.q@worker-5-4	# remove the "E" state of the node.
    
    Method 3
    This will remove the host right away, probably killing jobs running on it. Thus, use only when host is idle. But it should be the surest way to remove the node.
    qconf -de compute-1-16		# -de = disable execution host (kill running jobs!)
    qconf -ae compute-1-16		# re-add it when maintenance is done.
    
    

    Pausing queue

    pausing the queue means all existing jobs will continue to run, but jobs qsub after queue is disabled will just get queued up until it is re-enabled. ref: http://www.rocksclusters.org/roll-documentation/sge/4.2.1/managing-queues.html
    qmod -d default.q		# this actually place every node serving the queue in "d" state.  check with qstat -f
    qmod -e default.q
    
    

    Admin howto

    Add Submit Host
    
    See all exec node in cluster:
    qconf -sel
    
    Add submit host:
    qconf -as hostname
    	# note that execution and submit host reverse IP to DNS must lookup and match
    	# workstations can be submit host, they will not be running sgeexecd.  
    	# submit host don't need special software, but would need qsub to submit. 
    	# though often, sge client package is installed
    	# /etc/profile.d/settings.*sh will be added for configurations.  minimally:
    	#   SGE_CLUSTER_NAME=sge1
    	#   SGE_QMASTER_PORT=6444
    	#   optional?:
    	#   SGE_CELL=default
    	#   SGE_EXECD_PORT=6445
    	#   SGE_ROOT=/cm/shared/apps/sge/current
    	# submit host need to know who the master is.  
    	# ie /etc/hosts need entry for the cluster head node external IP
    
    
    Add Execution Host (ie compute node)
    
    Add an execution host to the system, modeled after an existing node:
    qconf -ae TemplateHostName		
    # the TemplateHostName is an existing node which the new addition will be model after (get their complex config, etc)
    # command will open in vi window. change name to new hostname, edit whatever complex etc.  save and exit.
    
    qconf -Ae  node-config.txt
    # -A means add node according to the spec from a file
    # the format of the file can be the cut-n-paste from what the vi window have when doing qconf -ae
    
    
    example of node-config.txt ::
    hostname              node-9-16.cm.cluster
    load_scaling          NONE
    complex_values        m_mem_free=98291.000000M,m_mem_free_n0=49152.000000M, \
                          m_mem_free_n1=49139.093750M,ngpus=0,slots=40
    user_lists            NONE
    xuser_lists           NONE
    projects              NONE
    xprojects             NONE
    usage_scaling         NONE
    report_variables      NONE
    ## above is cut-n-paste of qconf -ae node-1-1 + edit
    ## add as qconf -Ae node-config.txt
    
    A way to script adding a bunch of nodes at the same time. create a shell script with the following, and run it: (fix slots to match the number of cpu cores you have. Memory should be fixed by UGE.
    for X in $(seq 1 16 ) ; do
    
    	echo "
    	hostname              compute-2-$X
    	load_scaling          NONE
    	complex_values        m_mem_free=98291.000000M,m_mem_free_n0=49152.000000M, \
    	                      m_mem_free_n1=49139.093750M,ngpus=0,slots=28
    	user_lists            NONE
    	xuser_lists           NONE
    	projects              NONE
    	xprojects             NONE
    	usage_scaling         NONE
    	report_variables      NONE " > compute-2-$X.qc
    
    qconf -Ae compute-2-$X.qc
    rm compute-2-$X.qc
    
    done
    
    
    qhost -F gpu | grep -B1 ngpus=[1-9]		# list host with at least 1 GPU installed
    
    
    
    qconf -as hostname.fqdn
    $ -as = add submit host
    # note that execution and submit host rever IP to DNS must lookup and match
    
    
    # remove a workstation form a workstation.q
    # (needed before the execution host can be removed)
    qconf -mhgrp @wkstn
    # (open in vi env, edit as desired, moving whatever host that should not be in the grp anymore)
    
    remove execution host (think this will delete/kill any running job )
    qconf -de hostname
    
    
    # groups are member of queue
    # host are member of group?  or also to queue?
    
    
    # node doesn't want to run any job.  Check ExecD:
    # login to the node that isn't running.
    # don't do on head node!!
    /etc/init.d/sgeexecd.clustername stop
    # if don't stop, kill the process
    /etc/init.d/sgeexecd.clustername start
    
    
    # slots are defined in the queue, and also for each exec host.
    #
    # typically define slot based on cpu-cores, which are added to group named after them, eg quad, hex
    qconf -mq default.q 
    # vi, edit desired default slots, or slot avail in a given machine explictly.
    # eg
    slots                 8,[node-4-9.cm.cluster=24], \
                            [node-4-10.cm.cluster=20],[node-4-11.cm.cluster=20], \
                            [node-1-3.cm.cluster=12]
    
    qconf -se node-4-10	# see slots and other complex that the named execution host has
    
    

    Cgroups

    cgroups_params connects the configuration of cgroups feature. This need to be set via qconf -mconf. UGE 8.1.7 and later, and config may need to be added by hand if hot-upgraded the system.
      cgroups_params             cgroup_path=/cgroup cpuset=false m_mem_free_hard=true
    

    Places where (CPU) slots are defined

    (Not sure of order of precedence yet)

    jsv

    job submission verifier?
    a script that run that check if a submitted job passes all the requirements.  it may add flags to the job.
    System-wide JSV is specified in
       $SGE_ROOT/default/common/sge_request  eg
       /cm/shared/apps/sge/univa/default/common/sge_request
    and the jsv repository is typically at centralized location, eg
       /cm/shared/apps/sge/current/util/resources/jsv/
    
    User can add command line to specify JSV, but unlike most other SGE/UGE flags, all JSV options are additive, 
    ie, they will ALL be executed.  ditto for jsv specified in ~/.sge_request.
    
    

    Troubleshooting

  • "cannot run in PE "smp" because it only offers 0 slots" in qstat -j JOBID output.
    Assuming there are indeed nodes with number of cpu cores requested by -pe smp NN, then this is because such node is not available, so the job is queued.

    Various Notes...

    Assistant

    .bashrc functions and alias
    # count host and cpu cores.   Note that core count from qstat -g c differs 
    qhostTot() { qhost | sed 's/G//g' | awk '/^compute/ {h+=1; c+=$3; m+=$8} END {print "host="h " core=" c " RAM=" m "G"}' ; }
    
    qww() { qstat -u \* | perl -lane 'print $_ if $F[4] eq "qw"' ; }	# display jobs in qw state 
    qr()  { qstat -u \* | perl -lane 'print $_ if $F[4] eq "r"'  ; }	# display jobs in r  state
    
    alias qchk='qstat -f | awk '\''$6 ~ /[a-zA-Z]/  &&  $1 ~ /default.q@/  {print $0}'\'''	# show unhealthy hosts
    
    

    Grid Engine Ref

    PBS / TORQUE

    PBS = Portable Batch System.
    Original open source PBS became OpenPBS. OpenPBS is quite old now. TORQUE is the new development adapted from the original OpenPBS, and ClusterResources seems to be the main site putting all the glues together for it.
  • OpenPBS.org - points to Altair, who is still providing original code for download. Last version = 2.3.16 on 2001/12/06. Some patches seems to be available up till 2004 from a site hosted by Argone National Lab.
  • PBS Pro - Commercial version of PBS, supported by Altair (http://www.pbsgridworks.com)
  • Torque (Open Source) manual from Cluster Resources (WiKi).

    Two halves of the brain:
  • Scheduler: Read script only when it starts up. Def = pbs_sche, can be changed Maui or Moab for smarter scheduling.
  • Server: live process, change thru qmgr and effective immediately.
    Both pbs_mom and pbs_server must be started by root.
    
    qsub command-script.sh		# run script, stdout and stderr automatically saved as script.oXX and script.eXX
    
    qmgr -c 'print server'		# Print Server config, including pbs version.
    qmgr -c 'p s'			# Most command can be abreviated to first letter.
    
    qalter				# alter batch job, eg move job from one queue to another.
    
    

    Creating New Queue

    Run qmgr command as root:
    qmgr -c "create queue test_queue_tin queue_type=execution"
    qmgr -c "set    queue test_queue_tin started = True"
    qmgr -c "set    queue test_queue_tin enabled = True"
    qmgr -c "set    queue test_queue_tin max_running  = 2" 
    qmgr -c "set    queue test_queue_tin max_user_run = 1" 
    qmgr -c "set    queue test_queue_tin resources_default.nice = 2" 
    

    Integration with MPI

    Integration with MPICH is somewhat documented. They recommend mpiexec (PBS version at http://www.osc.edu/~pw/mpiexec/index.php#Too_many_mpiexecs ). This allows PBS to have better control. Note that there are "too many mpiexecs" and this is an add-on to the standard MPICH distribution.
    OpenMPI is touted as the best thing to use. OpenMPI automatically communicate with Torque/PBS Pro and retrieve host and np info. Jobs were to be submitted using mpirun rather than qsub... http://www.open-mpi.org/faq/?category=tm
    
    

    Deleting a Queue

    qmgr -c "delete queue test_queue_tin"
    

    Misc

    Prologue/Epilogue script - Run before/after a job (for all queues?)

    LSF

    Platform LSF (now owned by IBM) is an expensive HTC/HPC scheduler.
    OpenLava is an open source version of LSF.

    LSF User Guide (older version 4, but still usable, nice html format :) ) [local cache of Platform Computing doc]
    add pdf for quick ref guides ...
    
    
    
    
    
    
    
    bjobs -u all            # list all user jobs
    -w -u USERNAME  # wide format, w/o truncating fields.
    -l  JOBID       # long listing of a given job
     
    
    
    
    bhist -l -n 4           # list job history, going back 4 log files (if logs
    are available!)
    bhist -l         # show details of the job history,
    		# including how much time it was spend in state
    		# Pending, UserSuspend, SysSuspend, etc
    
    bjobs -lp               # list all pending jobs and the reason why they are
    pending (what they are waiting for).
    
    bpeek            # take a peek look at the current output of your job.
    		# LSF esentially capture the whole stdout of the job
    		# and displays it.  The whole buffer is displayed each time
    		# until the job is dead
    
    
    
    
    
    bsub "cmd option"       # submit job to be run as batch process
    		# LSF will execute the command in a select remote host (LSF client).  
    		# It will duplicate the env, including
    		# the current working dir, env var, etc
    		# Obviously, depends on unanimous NFS setup on all host
    
    		# anyway to reduce priority w/o creating new queues?
    
    
    bstop 		# send STOP signal to the job
    		# LSF essentially do kill -STOP on the dispatched
    		# process and all if the children
    		# ps -efl will show "T" for these processes.
    bstop 0			# stop all jobs owned by the current user
    
    bresume 		# send CONTinue signal to the job
    bresume 0 		# resume all jobs
    
    brequeue 	# Re-Queue a job.  LSF essentially kill the job
    		# and re-submit it, which will likely run on a different host
    		# Thus, it really depend on process that can be restarted gracefully.
    		# eg, setiathome will restart by reading state saved in xml files.
    
    
    lsload			# show loads
    bqueues			# list all queues
    lsinfo
    bhosts
    bparams
    
    lshosts			# display LSF "server nodes" and clients
    lshosts -l | egrep '(ENABLED|HOST_NAME)'	
    		# which which hosts has checked out license
    		# master node can be proxy that dolls out license to clients
    		# all license expires in midnite.
    					
    
    
    
    
    ENV setup
    <!--

    Admin/Accounting

    
    bacct 			# display stats on finished LSF jobs.
    -u username
    
    bacct -u all -C 3/1,3/31		# display aggregate job stats for all users for month of march.
    					# -C for completed (and exited) jobs.
    bacct -u all -S 2007/1/1,2007/3/31	# similar to above, but for Submitted jobs, with year explicityly defined.
    
    



    HPC Provisioning System

    Links

    See also: Ref:

    SGE:

    [Doc URL]
    tiny.cc/uge
    http://tin6150.github.io/psg/lsf.html New home page. Hosted at github and should always be up to date
    http://tin6150.s3-website-us-west-1.amazonaws.com/lsf.html (sporadically updated)
    http://grumpyxmas.medianewsonline.com/lsf.html Alt home, last updated Dec 2012
    (cc) 2017 Tin Ho. See main page for copyright info.



    hoti1
    sn5050
    tin6150 sn50