[Contents] [Prev] [Next] [End]
This chapter describes the commands that report and change the status of your jobs:
The bjobs command reports the status of LSF Batch jobs.
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 3926 user1 RUN priority hostF hostC verilog Oct 22 13:51 605 user1 SSUSP idle hostQ hostC Test4 Oct 17 18:07 1480 user1 PEND priority hostD generator Oct 19 18:13 7678 user1 PEND priority hostD verilog Oct 28 13:08 7679 user1 PEND priority hostA coreHunter Oct 28 13:12 7680 user1 PEND priority hostB myjob Oct 28 13:17
The -a option displays jobs that completed or exited in the recent past, along with pending and running jobs.
The -r option displays only running jobs.
The -u username option displays jobs submitted by other users. The special user name all displays jobs submitted by all users.
For example, to find out who is running jobs on which hosts enter:
% bjobs -u all
You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names. See the bjobs(1) manual page for more information.
When you submit a job to LSF Batch, it may be held in the queue before it starts running and it may be suspended while running. The -p option to the bjobs command displays the reasons a job is pending. Because there can be more than one reason the job is pending or suspended, all reasons that contributed to the pending or suspension are reported. For example:
% bjobs -p 7678 user1 PEND priority hostD verilog Oct 28 13:08 Queue's resource requirements not satisfied:3 hosts; Unable to reach slave lsbatch server: 1 host; Not enough job slots: 1 host;
The pending reasons will also mention the number of hosts for each condition. To get the specific host names, along with pending reasons, use the -p and -l options to the bjobs command. For example:
% bjobs -lp Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority> , Command <verilog> Mon Oct 28 13:08:11:Submitted from host <hostD>,CWD <$HOME>,Requested Resources <type==any && swp>35>; PENDING REASONS: Queue's resource requirements not satisfied: hostb, hostk, hostv; Unable to reach slave lsbatch server: hostH; Not enough job slots: hostF; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - 0.7 1.0 - 4.0 - - - - - - loadStop - 1.5 2.5 - 8.0 - - - - - -
Note:
In a cluster with many hosts (100-200 hosts), it may be too verbose or considered unnecessary to always show the host names with the pending reasons. Therefore, use the bjobs command with the -p option only.
The -l option to the bjobs command displays detailed information about job status and parameters, such as the job's current working directory, parameters specified when the job was submitted, and the time when the job started running.
% bjobs -l 7678 Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority> , Command <verilog> Mon Oct 28 13:08:11:Submitted from host <hostD>,CWD <$HOME>, Requested Resources <type==any && swp>35>; PENDING REASONS: Queue's resource requirements not satisfied:3 hosts; Unable to reach slave lsbatch server: 1 host; Not enough job slots: 1 host; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - 0.7 1.0 - 4.0 - - - - - - loadStop - 1.5 2.5 - 8.0 - - - - - -
The loadSched and loadStop thresholds displayed are those that apply to this job. If the job is pending, the thresholds are taken from the queue. If the job has been dispatched, each threshold is the more restrictive of the queue and execution host thresholds for that load index.
Scheduling is also affected by other queue constraints such as RES_REQ, STOP_COND, RESUME_COND, fairshare policy, and others.
The -s option displays the reasons a batch job was suspended. Because the load conditions are constantly changing, the reasons for suspension may be out of date. Once the job is suspended it does not resume execution until its scheduling conditions are met.
% bjobs -s JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 605 user1 SSUSP idle hostQ hostC Test4 Oct 17 18:07 The host load exceeded the following threshold(s): Paging rate: pg; Idle time: it;
In the example above, the job was suspended because the paging rate and interactive idle time on the execution host went above the suspending threshold. Even though the paging rate may have dropped back below the scheduling threshold, the job may remain suspended because of another threshold. The job does not resume until all load indices are within their scheduling thresholds.
Jobs submitted through the LSF Batch system have the resources they consume monitored while they are running. The -l option of the bjobs command displays the current resource usage of the job. This job-level information includes:
The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME seconds (see 'The lsb.params File' of the LSF Administrator's Guide for the value of SBD_SLEEP_TIME). The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created.
% bjobs -l 1531 Job Id <1531>, User <user1>, Project <default>, Status <RUN>, Queue <priority> Command <example 200> Fri Dec 27 13:04:14: Submitted from host <hostA>, CWD <$HOME>, Specified Hosts <hostD>; Fri Dec 27 13:04:19: Started on <hostD>, Execution Home </home/user1>, Executio n CWD </home/user1>; Fri Dec 27 13:05:00: Resource usage collected. The CPU time used is 2 seconds. MEM: 147 Kbytes; SWAP: 201 Kbytes PGID: 8920; PIDs: 8920 8921 8922 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
Sometimes you want to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of batch jobs. The -l option of the bhist command prints the time information and a complete history of the scheduling events for each job.
% bhist -l 1531 Job Id <1531>, User <user1>, Project <default>, Command <example 200> Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority>, CWD <$HOM E>, Specified Hosts <hostD>; Fri Dec 27 13:04:19: Dispatched to <hostD>; Fri Dec 27 13:04:19: Starting (Pid 8920); Fri Dec 27 13:04:20: Running with execution home </home/user1>, Execution CWD < /home/user1>, Execution Pid <8920>; Fri Dec 27 13:05:49: Suspended by the user or administrator; Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after being resumed by user; Fri Dec 27 13:05:57: Running; Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 seconds. Summary of time in seconds spent in various states by Fri Dec 27 13:07:52 1996 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 5 0 205 7 1 0 218
The -J job_name option of the bhist command displays the history of all LSF Batch jobs with the specified job name. Job names are assigned with the -J job_name option of the bsub command.
LSF keeps job history information after the job exits, so you can look at the history of jobs that completed in the past. The length of the history depends on how often the LSF administrator cleans up the log files.
By default, bhist only displays job history from the current event log file. The -n option to the bhist command allows users to display the history of jobs that completed a long time ago, and are no longer listed in the active event log.
The LSF Batch system periodically backs up and prunes the job history log. The -n num_logfiles option tells the bhist command to search through the specified number of log files instead of only searching the current log file. Log files are searched in reverse time order; for example, the command bhist -n 3 searches the current event log file and then the two most recent backup files.
The output from an LSF Batch job is normally not available until the job is finished. However, LSF Batch provides the bpeek command for you to look at the output the job has produced so far. By default, bpeek shows the output from the most recently submitted job; you can also select the job by queue or execution host, or specify the job ID or job name on the command line.
% bpeek 1234 << output from stdout >> Starting phase 1 Phase 1 done Calculating new parameters ...
Only the job owner can use bpeek to see job output. The bpeek command will not work on a job running under a different user account.
You can use this command to check if your job is behaving as you expected and kill the job if it is running away or producing unusable results. This could save you time.
The bqueues and bhosts commands display the number of jobs in a queue or dispatched to a host. For more information on these commands see 'Batch Queues' and 'Batch Hosts'.
The bkill command cancels pending batch jobs and sends signals to running jobs. By default, bkill sends the SIGKILL signal to running jobs. For example, to kill job 3421 enter:
% bkill 3421 Job <3421> is being terminated
20 seconds before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from the mbatchd to the sbatchd. The sbatchd waits for the job to exit before reporting the status. Because of this, bjobs may still report that the job is running for a few seconds.
To send an arbitrary signal to your job, use the -s option of the bkill command. You can specify either the signal name or the signal number. On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) manual page.
% bkill -s TSTP 3421 Job <3421> is being signaled
This example sends the TSTP signal to job 3421.
Note
Signal numbers are translated across different platforms because different operating systems may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill command is issued. For example, if you send signal 18 from an SunOS 4.x host, it means SIGTSTP. If the job is running on an HP-UX, SIGTSTP is defined as signal number 25, so signal 25 is sent to the job.
Only the owner of a batch job or an LSF administrator can send signals to a job.
The bstop and bresume commands are convenient aliases for bkill -s, sending the SIGSTOP/SIGTSTP and SIGCONT signals respectively.
Note
You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF Batch does allow you to kill, suspend and resume pending jobs.
% bstop 3421 Job <3421> is being stopped
bstop sends the SIGSTOP signal to sequential jobs and SIGTSTP to parallel jobs. SIGTSTP is sent to a parallel job so the master process can trap the signal and pass it to all the slave processes running on other hosts.
To resume the same job, enter:
% bresume 3421 Job <3421> is being resumed
Suspending a job causes your job to go into USUSP state if the job is already started, or to go into PSUSP state if your job is pending. Resuming a user suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.
The btop and bbot commands move pending jobs within a queue. btop moves jobs toward the top of the queue, so that they are dispatched before other pending jobs. bbot moves jobs toward the bottom of the queue so that they are dispatched later. The default behaviour is to move the job as close to the top or bottom of the queue as possible. By specifying a position on the command line, you can move a job to an arbitrary position relative to the top or bottom of the queue.
The btop and bbot commands do not allow users to move their own jobs ahead of those submitted by other users; only the dispatch order of the user's own jobs is changed. Only an LSF administrator can move one user's job ahead of another.
Note
The btop and bbot commands have no effect on the job dispatch order when fairshare policies are used.
% bjobs -u all JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16 5309 user2 PEND night hostA sleep 200 Oct 23 11:04 5310 user1 PEND night hostB myjob Oct 23 13:45 5311 user2 PEND night hostA sleep 700 Oct 23 18:17 % btop 5311 Job <5311> has been moved to position 1 from top. % bjobs -u all JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16 5311 user2 PEND night hostA sleep 700 Oct 23 18:17 5310 user1 PEND night hostB myjob Oct 23 13:45 5309 user2 PEND night hostA sleep 200 Oct 23 11:04
Note that user1's job is still in the same position on the queue. User2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up on the queue, the rest of his jobs move down.
The bswitch command switches pending and running jobs from queue to queue. This is useful if you submit a job to the wrong queue, or if the job is suspended because of the queue thresholds or run windows and you would like to resume the job.
% bswitch priority 5309 Job <5309> is switched to queue <priority> % bjobs -u all JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD sleep 500 Oct 23 10:16 5309 user2 RUN priority hostA hostB sleep 200 Oct 23 11:04 5311 user2 PEND night hostA sleep 700 Oct 23 18:17 5310 user1 PEND night hostB myjob Oct 23 13:45
The parameters associated with a job can be modified after the job has been submitted. The bmodify command allows for changes to the parameters of already submitted jobs. The parameters of a job can be modified only if the job is in pending status.
The bmodify command takes the same options as the bsub command together with a job ID. (See 'Submitting Batch Jobs'.) The given options replace the existing options of the specified job.
To reset an option to its default value, append the n character to the option name. No option value should be specified when resetting an option. For example:
% bmodify -bn 123
Job 123 will be dispatched as soon as possible, ignoring any previously specified start time. The following example shows how bmodify is used to change the start time to 2 A.M.:
% bmodify -b 2:00
Note
The job command line itself and the environment variables present at submission time cannot be modified. In versions of LSF prior to V3.0, the shell option set with the -L argument also could not be modified.
Most of the operations discussed in this chapter can also be performed using the xlsbatch GUI. The main window of xlsbatch is shown in 'Figure 4. xlsbatch Main Window'.
You can view job details by first select a job and then click on the 'Detail' button. The resulting popup window is shown in Figure 11. This gives you the same information as you can get by running the bjobs -l command.
The 'History' button gives you a popup window for job history as you can otherwise get through the bhist command.
To perform control actions on jobs, such as killing a job or suspending/resuming a job, simply select the job and then click on an action button.
You can also invoke the xbsub window from inside xlsbatch to submit new jobs. If you want to modify a job parameter, simply select on the job and click on 'Modify' button to get the job modification popup window. Note that this window can also be invoked by running xbmodify from the command line. Figure 12 shows the xbmodify window. This window is the similar to the xbsub window except that the command line is read-only.
Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.