[Contents] [Prev] [Next] [End]


Chapter 7. Managing Jobs


LSF JobScheduler provides a single system image for your cluster so that you can use the whole cluster as if it were a single computer. After you have submitted jobs into the system, you can view the status of your jobs or do various manipulations on your jobs from anywhere in the cluster. This chapter demonstrates the job tracking and manipulation tools in JobScheduler.

Tracking Jobs

Checking Status of your Jobs

The status of a submitted job is one of the following:

PEND
The job is pending, that is, it has not yet been started.
RUN
The job is currently running.
PSUSP
The job has been suspended while it was pending.
USUSP
The job has been suspended while it was running.
SSUSP
The job has been suspended by the JobScheduler system while running.
DONE
The job has terminated with status of zero (0).
EXIT
The job has terminated with a non-zero status.
UNKWN
The controlling daemons (mbatchd and sbatchd) have lost communication.
ZOMBI
A job will become ZOMBI if the job is killed by the bkill command while the current status is shown as UNKWN.

Job status is changed by:

See 'Job Status' for further information about job states.

Use the bjobs command to view the submitted jobs.

% bjobs
JOBID USER     STAT  QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
6848  user1    PEND  sysadm     hostA                   diskcheck  Dec 17 11:52
7142  user1    PEND  sysadm     hostA                   backup     Dec 21 15:45

By default bjobs will only display the jobs you submitted. Use the -u user_name option to view the jobs of other users. Use the reserved user name all to see the jobs of all the users.

% bjobs -u all
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
6745  user2    RUN   business   hostD      hostB        report     Dec 19 09:04
6916  user3    RUN   business   hostA      hostD        analyse    Dec 19 09:05
6848  user1    PEND  sysadm     hostA                   diskcheck  Dec 17 11:52
7142  user1    PEND  sysadm     hostA                   backup     Dec 21 15:45
7157  user4    PEND  night      hostA                   forecast   Dec 18 10:56

Sometimes you may have forgotten detailed attributes about your jobs. Use -l option to view everything about your jobs. You can also specify the jobID to view a particular job.

% bjobs -l 7142
Job Id <7142>, Job Name <backup>, User <user1>, Project <default>, Status <PEND
>, Queue <sysadm>, Command </var/adm/backup/bin/dumpit>
Sat Dec 21 15:04:34: Submitted from host <hostA>, CWD </var/adm>, Specified Hos
ts <hostD>, Dependency Condition(calendar(weekly)); 

PENDING REASONS:
Job dependency condition not satisfied;

SCHEDULING PARAMETERS:
          r15s   r1m    r15m     ut    pg   io     ls    it   tmp   swp    mem
loadSched   -     -      -        -    -     -     -      -    -     -      -
loadStop    -     -      -        -    -     -     -      -    -     -      -

The SCHEDULING PARAMETERS for a job come from queue's SCHEDULING PARAMETERS, as described in 'Detailed Queue Information'.

Use the -s option to view the suspended jobs only, showing the reason why the jobs are suspended.

% bjobs -s
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1999  user1    PSUSP default    hostA                   jobA       Dec 10 15:33
The job was suspended by user or system admin while pending;

Use the -p option to view the pending jobs only. Along with the job information it also shows the reason why each job was not dispatched during the last dispatch turn.

% bjobs -p
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1999  user1    PSUSP default    hostA                   jobA       Dec 10 15:33
The job was suspended by user or system admin while pending;
5518  user1    PEND  default    hostA                   jobB       Dec 14 10:27
Job dependency condition not satisfied;
8056  user1    PEND  default    hostA                   jobB       Dec 20 11:41
Job dependency condition not satisfied;

If you want to know more details about why your jobs are pending, use both -p and -l options. You can use several options in combination to get more details from the system.

Tracking a Job

You may need to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of jobs. Use the -l option to print the time information and a complete history of the scheduling events for each run of your job.

% bhist -l

Job Id <7848>, Job Name <diskcheck>, User <user1>, Project <default>, Command <
                     find / -name core -atime +7 -exec rm {} \;>
Tue Dec 17 11:52:13: Submitted from host <hostA> to Queue <default>, CWD <$HOME
                     >, Dependency Condition <calendar(daily)>;
Sat Dec 21 07:00:12: Started on <hostA>, Pid <29027>;
Sat Dec 21 07:00:12: Running with execution home </home/user1>, Execution CWD <
                     /home/user1>;
Sat Dec 21 07:00:55: Done successfully. The CPU time used is 12.2 seconds;
Sun Dec 22 07:00:05: Started on <hostA>, Pid <986>;
Sun Dec 22 07:00:05: Running with execution home </home/user1>, Execution CWD <
                     /home/user1>;
Sun Dec 22 07:01:18: Done successfully. The CPU time used is 11.9 seconds;
Mon Dec 23 07:00:02: Started on <hostA>, Pid <2892>;
Mon Dec 23 07:00:02: Running with execution home </home/user1>, Execution CWD <
                     /home/user1>;
Mon Dec 23 07:01:13: Done successfully. The CPU time used is 10.5 seconds;
Tue Dec 24 07:00:10: Started on <hostA>, Pid <4905>;
Tue Dec 24 07:00:10: Running with execution home home/user1>, Execution CWD </h
                     ome/user1>;
Tue Dec 24 07:03:31: Done successfully. The CPU time used is 19.7 seconds;
Tue Dec 24 15:17:14: Delete requested by user or administrator <user1>;
Tue Dec 24 15:17:14: Exited. The CPU time used is 0.0 seconds.

Summary of time in seconds spent in various states by Tue Dec 24 15:17:14 1996
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  617057   0        44       0        0        0        617101

JobScheduler keeps job history information after the job completes a run, so you can look at the history of jobs that ran in the past. The length of the history depends on how often JobScheduler prunes event log files. The system automatically backs up and prunes the job history log when necessary.

By default, bhist only displays job history from the current event log file. The -n num_logfiles option tells the bhist command to search through the specified number of log files instead of only searching the current log file. Log files are searched from the most recent files starting with the current event file and then the backup files.

% bhist -n 3

The above command will read the current event file and then the two most recent backup files.

Modifying a Job

The bmodify command allows you to modify the options of a submitted job. The value for the option you want to modify is overridden with a new value using the same option syntax as the bsub command.

% bmodify -w "calendar(complex)" 7848

To reset an option to its default value, use the option string followed by 'n'. No value should be specified when resetting an option.

% bmodify -wn 7848

Modifying option values will only affect future scheduling of the job. If the job is not dependent on a calendar and has already been started, then bmodify will not change any option values.

The -O option allows you to change the options of a calendar dependent job to affect only the next run of the job.

% bmodify -O -w "calendar(complex)" 7848

After a job has run once with the new values, the old values for the options are restored for subsequent runs. If a job already has been dispatched, then any option changes will take effect the next time the job is scheduled.

Note
All options specified at submission time may be changed except for the job command line and the environment variables.

Changing Queues

You can use the bmodify command to change queues.

% bmodify -q resubmit 7848

There is also the bswitch command which is used to switch one or more unfinished jobs from one queue to another.

% bswitch -J diskcheck resubmit
Job <7848> is switched to queue <resubmit>

Removing a Job

Use the bdel command to remove a calendar-driven job. This command removes a specific job associated with a calendar from the system. If the job is currently running, bdel kills the process before removing the job from the system.

% bdel 3456
Job <3456> is being deleted

You can specify a job by name using the -J option.

% bdel -J jobA
Job <3457> is being deleted

Use the bkill command on a job that depends on another job or on a file or external event.

% bkill 3467
Job <3467> is being terminated

You can use bkill to send an arbitrary signal to your job using the -s option. You can specify either the signal name or the signal number. On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) manual page.

% bkill -s SIGTSTP 3488
Job <3488> is being signalled

This example sends the SIGTSTP signal (terminal stop) to the job.

Note
Different operating systems use different numbering sequences for signals. Therefore, signal numbers are translated across platforms. The intended meaning of a signal is interpreted by the machine from which the bkill command is issued. For example, if you send signal 24 from a SUN Solaris host, it means SIGTSTP. If the job is running on an HP-UX server, SIGTSTP is defined as signal number 25, so signal 25 is sent to the job.

Using bkill on a job associated with calendars kills the current run, if the job has been started, and requeues the job. If the job is not currently running, bkill has no effect. To permanently remove the job, use bdel.

You can only delete or kill your own jobs. Only the JobScheduler administrators can operate on jobs submitted by other users.

Specifying Number of Executions

You can specify the number of times your job will execute. Submit the job, then use the -n num_runs option to the bdel command. After the job runs the specified times, it is deleted from the system.

% bsub -w "calendar(daily)" -J jobA command
Job <8087> is submitted to default queue <default>.
% bdel -n 5 -J jobA
Job <8087> will be deleted after running next 5 times

Suspending and Resuming Jobs

The bstop and bresume commands are convenient aliases for bkill -s, sending the SIGSTOP/SIGTSTP and SIGCONT signals respectively.

You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, you can send kill, suspend and resume signals to pending jobs.

% bstop -J diskcheck
Job <7848> is being stopped

bstop sends the SIGSTOP signal to sequential jobs and SIGTSTP to parallel jobs. SIGTSTP is sent to a parallel job so the master process can trap the signal and pass it to all the slave processes running on other hosts. Suspending causes your job to go into USUSP state if it has already started, or to go into PSUSP state if it is pending.

% bresume -J diskcheck
Job <7848> is being resumed

Resuming a user suspended job does not immediately put your job into RUN state. The job must first satisfy its dependency conditions. bresume first puts your job into SSUSP state. The job can then be scheduled accordingly.

Note
Sending arbitrary signals to a job running on a Windows NT machine is not supported. You can only use the bstop and bresume commands on a job running on Windows NT.

Managing Related Jobs

After you have submitted a number of jobs all assigned the same job_name (see 'Grouping Related Jobs'), you can use that job_name to refer to the jobs as a group.

% bjobs -J job_group
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
8315  user1    PEND  default    hostA                   job_group  Dec 24 15:17
8316  user1    PEND  default    hostA                   job_group  Dec 24 15:18
8318  user1    RUN   default    hostA                   job_group  Dec 24 15:22

To switch all jobs in a group to another queue:

% bswitch -J job_group normal 0
Job <8315> is switched to queue <normal>
Job <8316> is switched to queue <normal>
Job <8318> is switched to queue <normal>

To suspend all jobs in a group:

% bstop -J job_group 0
Job <8315> is being stopped
Job <8316> is being stopped
Job <8318> is being stopped

To remove all jobs in a group:

% bdel -J job_group 0
Job <8315> is being deleted
Job <8316> is being deleted
Job <8318> is being deleted

To show the history of all jobs in a group:

% bhist -J job_group 0
Summary of time in seconds spent in various states:
JOBID USER    JOB_NAME   PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
8315  user1   *b_group   17      49      39      81      0       0       186
8316  user1   *b_group   84      50      0       0       0       0       134
8318  user1   *b_group   18      50      43      19      0       0       130

Note
For the bjobs command all jobs of the same name are considered. For the other commands, only the last submitted job of this name is considered by default, unless the special jobID 0 is specified. The bmodify command only operates on a single jobID.

Using JobScheduler GUI to Manage Jobs

You may prefer to use the JobScheduler GUI to manage your jobs. xlsbatch displays all JobScheduler entities such as jobs, queues, and hosts. It also allows you to manipulate jobs directly from the GUI. Figure 24 is the xlsbatch job information window.

Figure 24. xlsbatch Job Window

xlsbatch Job Window

By selecting jobs in the job window, you can directly perform job manipulations that have been discussed in this chapter. For example, by selecting a job and then clicking on the 'Detail' button, a pop-up window will be started showing all details of your selected job, as shown in Figure 25.

Figure 26 shows the detailed job history window if you click on the 'History' button .

Figure 25. Detailed Job Information Window

Detailed Job Information Window

Figure 26. Detailed Job History Window

Detailed Job History Window

Detailed usage of the xlsbatch is described in the on-line help.

If you want to modify a job, you can either click on the 'Modify' button, or run the xbmodify GUI directly from the command line. A job modification window will pop-up as shown in Figure 27.

Figure 27. Job Modification Window

Job Modification Window

By clicking on the 'Job ID' button , you get another pop-up window with all jobIDs for you to select, as shown in Figure 28.

Figure 28. JobID Selection Window

JobID Selection Window


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.