[Contents] [Prev] [Next] [Bottom]
This chapter shows how to use LSBLIB to access the services provided by LSF Batch and LSF JobScheduler. Since LSF Batch and LSF JobScheduler are built on top of the LSF Base, LSBLIB relies on services provided by LSLIB. Thus if you use LSBLIB functions, you must link your program with both LSLIB and LSBLIB.
LSF Batch and LSF JobScheduler services are mostly provided by mbatchd, except services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and LSF JobScheduler. The functions described for LSF Batch in this chapter also apply to LSF JobScheduler, unless explicitly indicated otherwise.
Before accessing any of the services provided by the LSF Batch and LSF JobScheduler, an application must initialize LSBLIB. It does this by calling the following function:
int lsb_init()>lsb_init(appname);
On success, it returns 0; otherwise, it returns -1 and sets lsberrno to indicate the error.
The parameter appname is used only if you want to log detailed messages about the transactions inside LSLIB for debugging purpose. The messages will be logged only if LSB_CMD_LOG_MASK is defined as LOG_DEBUG1.
The messages will be logged in file LSF_LOGDIR/appname. If appname is NULL, the log file is LSF_LOGDIR/bcmd.
Note
This function must be called before any other function in LSBLIB can be called.
LSF Batch queues hold the jobs in the LSF Batch and set scheduling policies and limits on resource usage.
LSBLIB provides a function to get information about the queues in the LSF Batch. This includes queue name, parameters, statistics, status, resource limits, scheduling policies and parameters, and users and hosts associated with the queue.
The example program under discussion in this section uses the following LSBLIB function to get the queue information:
struct queueInfoEnt *lsb_queueinfo()>lsb_queueinfo(queues,numQueues,hostname,username,options)
On success, this function returns an array containing a queueInfoEnt structure (see below) for each queue of interest and sets *numQueues to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:
char **queues; /* An array containing names of queues of interest */ int *numQueues; /* The number of names in queues */ char *hostname; /* Only queues using hostname are of interest */ char *username; /* Only queues enabled for user are of interest */ int options; /* Reserved for future use; supply 0 */
To get information on all queues, set *numQueues to 0; *numQueues will be updated to the actual number of queues returned on a successful return.
If *numQueues is 1 and queue is NULL, information on the system default queue is returned.
If hostname is not NULL, then all queues using host hostname as a batch server host will be returned. If username is not NULL, then all queues allowing user username to submit jobs to will be returned.
The queueInfoEnt structure is defined in lsbatch.h as:
struct queueInfoEnt { char *queue; /* Name of the queue */ char *description; /* Description of the queue */ int priority; /* Priority of the queue */ short nice; /* Nice value at which jobs in the queue will be run */ char *userList; /* Users allowed to submit jobs to the queue */ char *hostList; /* Hosts to which jobs in the queue may be dispatched */ int nIdx; /* Size of the loadSched and loadStop arrays */ float *loadSched; /* Load thresholds that control scheduling of jobs from the queue */ float *loadStop; /* Load thresholds that control suspension of jobs from the queue */ int userJobLimit; /* Number of unfinished jobs a user can dispatch from the queue */ int procJobLimit; /* Number of unfinished jobs the queue can dispatch to a processor */ char *windows; /* Queue run window */ int rLimits[LSF_RLIM_NLIMITS]; /* The per-process resource limits for jobs */ char *hostSpec; /* Obsolete. Use defaultHostSpec instead */ int qAttrib; /* Attributes of the queue */ int qStatus; /* Status of the queue */ int maxJobs; /* Job slot limit of the queue. */ int numJobs; /* Total number of job slots required by all jobs */ int numPEND; /* Number of job slots needed by pending jobs */ int numRUN; /* Number of jobs slots used by running jobs */ int numSSUSP; /* Number of job slots used by system suspended jobs */ int numUSUSP; /* Number of jobs slots used by user suspended jobs */ int mig; /* Queue migration threshold in minutes */ int schedDelay; /* Schedule delay for new jobs */ int acceptIntvl; /* Minimum interval between two jobs dispatched to the same host */ char *windowsD; /* Queue dispatch window */ char *nqsQueues; /* A blank-separated list of NQS queue specifiers */ char *userShares; /* A blank-separated list of user shares */ char *defaultHostSpec; /* Value of DEFAULT_HOST_SPEC for the queue in lsb.queues */ int procLimit; /* Maximum number of job slots a job can take */ char *admins; /* Queue level administrators */ char *preCmd; /* Queue level pre-exec command */ char *postCmd; /* Queue's post-exec command */ char *requeueEValues; /* Queue's requeue exit status */ int hostJobLimit; /* Per host job slot limit */ char *resReq; /* Queue level resource requirement */ int numRESERVE; /* Reserved job slots for pending jobs */ int slotHoldTime; /* Time period for reserving job slots */ char *sndJobsTo; /* Remote queues to forward jobs to */ char *rcvJobsFrom; /* Remote queues which can forward to me */ char *resumeCond; /* Conditions to resume jobs */ char *stopCond; /* Conditions to suspend jobs */ char *jobStarter; /* Queue level job starter */ char *suspendActCmd; /* Action commands for SUSPEND */ char *resumeActCmd; /* Action commands for RESUME */ char *terminateActCmd; /* Action commands for TERMINATE */ int sigMap[LSB_SIG_NUM]; /* Configurable signal mapping */ char *preemption; /* Preemption policy */ int maxRschedTime; /* Time period for remote cluster to schedule job */ };
The variable nIdx is the number of load threshold values for job scheduling. This is in fact the total number of load indices as returned by LIM. The parameters sndJobsTo, rcvJobsFrom, and maxRschedTime are only used with LSF MultiCluster.
For a complete description of the fields in the queueInfoEnt structure, see the lsb_queueinfo(3) man page.
The program below takes the first argument as a queue name and displays information about the named queue.
#include <stdio.h> #include <lsf/lsbatch.h> int main (argc, argv) int argc; char *argv[]; { struct queueInfoEnt *qInfo; int numQueues = 1; char *queue=argv[1]; int i; if (argc != 2) { printf("Usage: %s queue_name\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init()"); exit(-1); } qInfo = lsb_queueinfo(&queue, &numQueues, NULL, NULL, 0); if (qInfo == NULL) { lsb_perror("lsb_queueinfo()"); exit(-1); } printf("Information about %s queue:\n", queue); printf("Description: %s\n", qInfo[0].description); printf("Priority: %d Nice: %d \n", qInfo[0].priority, qInfo[0].nice); printf("Maximum number of job slots:"); if (qip->maxJobs < INFINIT_INT) printf("%5d\n", qInfo[0].maxJobs); else printf("%5s\n", "unlimited"); printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) TOTAL(%d).\n", qInfo[0].numPEND, qInfo[0].numRUN, qInfo[0].numSSUSP + qInfo[0].numUSUSP, qInfo[0].numJobs); exit(0); }
The header file lsbatch.h must be included with every application that uses LSBLIB functions. Note that lsf.h does not have to be explicitly included in your program because lsbatch.h already has lsf.h included. The function lsb_perror() is used in much the same way ls_perror() is used to print error messages regarding function call failure. You could check lsberrno if you want to take different actions for different errors.
In the above program, INFINIT_INT is defined in lsf.h and is used to indicate that there is no limit set for maxJobs. This applies to all LSF API function calls. LSF will supply INFINIT_INT automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you were to display the loadSched/loadStop values, an INFINIT_INT indicates that the threshold is not configured and is ignored.
Note
Like the returned data structures by LSLIB functions, the returned data structures from an LSBLIB function is dynamically allocated inside LSBLIB and is automatically freed next time the same function is called. You should not attempt to free the space allocated by LSBLIB. If you need to keep this information across calls, make your own copy of the data structure.
The above program will produce output similar to the following:
Information about normal queue: Description: For normal low priority jobs Priority: 25 Nice: 20 Maximum number of job slots : 40 Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)
LSF Batch server hosts execute the jobs in the LSF Batch system.
LSBLIB provides a function to get information about the server hosts in the LSF Batch system. This includes both configured static information as well as dynamic information. Examples of host information include host name, status, job limits and statistics, dispatch windows and scheduling parameters.
The example program to be discussed in this section uses the following LSBLIB function:
struct hostInfoEnt *lsb_hostinfo()>lsb_hostinfo(hosts, numHosts)
This function gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt structures which hold the host information and sets *numHosts to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:
char **hosts; /* An array of names of hosts of interest */ int *numHosts; /* The number of names in hosts */
To get information on all hosts, set *numHosts to 0; *numHosts will be set to the actual number of hostInfoEnt structures when this call returns successfully.
If *numHosts is 1 and hosts is NULL, information on the local host is returned.
The hostInfoEnt structure is defined in lsbatch.h as
struct hostInfoEnt { char *host; /* Name of the host */ int hStatus; /* Status of host. (see below) */ int busySched; /* Reason host will not schedule jobs */ int busyStop; /* Reason host has suspended jobs */ float cpuFactor; /* Host CPU factor, as returned by LIM */ int nIdx; /* Size of the loadSched and loadStop arrays, as returned from LIM */ float *load; /* Load LSF Batch used for scheduling batch jobs */ float *loadSched; /* Load thresholds that control scheduling of jobs on host */ float *loadStop; /* Load thresholds that control suspension of jobs on host */ char *windows; /* Host dispatch window */ int userJobLimit; /* Maximum number of jobs a user can run on host */ int maxJobs; /* Maximum number of jobs that host can process concurrently */ int numJobs; /* Number of jobs running or suspended on host */ int numRUN; /* Number of jobs running on host */ int numSSUSP; /* Number of jobs suspended by sbatchd on host */ int numUSUSP; /* Number of jobs suspended by a user on host */ int mig; /* Migration threshold for jobs on host */ int attr; /* Host attributes */ #define H_ATTR_CHKPNTABLE 0x1 #define H_ATTR_CHKPNT_COPY 0x2 float *realLoad; /* The load mbatchd obtained from LIM */ int numRESERVE; /* Num of slots reserved for pending jobs */ int chkSig; /* This variable is obsolete */ };
Note the differences between host information returned by LSLIB function ls_gethostinfo() and host information returned by the LSBLIB function lsb_hostinfo(). The former returns general information about the hosts whereas the latter returns LSF Batch specific information about hosts.
For a complete description of the fields in the hostInfoEnt structure, see the lsb_hostinfo(3) man page.
The example program below takes a host name as an argument and displays various information about the named host. It is a simplified version of the LSF Batch bhosts command:
#include <stdio.h> #include <lsf/lsbatch.h> main (argc, argv) int argc; char *argv[]; { struct hostInfoEnt *hInfo; int numHosts = 1; char *hostname = argv[1]; int i; if (argc != 2) { printf("Usage: %s hostname\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } hInfo = lsb_hostinfo(&hostname, &numHosts); if (hInfo == NULL) { lsb_perror("lsb_hostinfo"); exit (-1); } printf("HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP\n"); printf ("%-18.18s", hInfo->host); if (hInfo->hStatus & HOST_STAT_UNLICENSED) { printf(" %-9s\n", "unlicensed"); continue; /* don't print other info */ } else if (hInfo->hStatus & HOST_STAT_UNAVAIL) printf(" %-9s", "unavail"); else if (hInfo->hStatus & HOST_STAT_UNREACH) printf(" %-9s", "unreach"); else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND | HOST_STAT_DISABLED | HOST_STAT_LOCKED | HOST_STAT_FULL | HOST_STAT_NO_LIM)) printf(" %-9s", "closed"); else printf(" %-9s", "ok"); if (hInfo->userJobLimit < INFINIT_INT) printf("%4d", hInfo->userJobLimit); else printf("%4s", "-"); printf("%7d %4d %4d %4d\n", hInfo->numJobs, hInfo->numRUN, hInfo->numSSUSP, hInfo->numUSUSP); exit(0); }
hStatus is the status of the host. It is the bitwise inclusive OR of some of the following constants defined in lsbatch.h:
If none of the above holds, hStatus is set to HOST_STAT_OK to indicate that the host is ready to accept and run jobs.
The constant INFINIT_INT defined in lsf.h is used to indicate that there is no limit set for userJobLimit.
The example output from the above program follows:
% a.out hostB HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP hostB ok - 2 1 1 0
Job submission and modification are most common operations in the LSF Batch system. A user can submit jobs to the system and then modify them if the job has not been started.
LSBLIB provides one function for job submission and one function for job modification:
int lsb_submit(jobSubReq, jobSubReply) int lsb_modify(jobSubReq, jobSubReply, jobId)
On success, these calls return the job ID. Otherwise, -1 is returned with lsberrno set to indicate the error. These two functions are similar except that lsb_modify() modifies the parameters of an already submitted job.
Both of these functions use the same data structure:
struct submit *jobSubReq; /* Job specifications */ struct submitReply *jobSubReply; /* Results of job submission */ int jobId; /* Id of the job to modify (lsb_modify() only) */
The submit structure is defined in lsbatch.h as:
struct submit { int options; /* Indicates which optional fields are present */ char *jobName; /* Job name (optional) */ char *queue; /* Submit the job to this queue (optional) */ int numAskedHosts; /* Size of askedHosts (optional) */ char **askedHosts; /* An array of names of candidate hosts (optional) */ char *resReq; /* Resource requirements of the job (optional) */ int rlimits[LSF_RLIM_NLIMITS]; /* Limits on system resource use by all of the job's processes */ char *hostSpec; /* Host model used for scaling rlimits (optional) */ int numProcessors; /* Initial number of processors needed by the job */ char *dependCond; /* Job dependency condition (optional) */ time_t beginTime; /* Dispatch the job on or after beginTime */ time_t termTime; /* Job termination deadline */ int sigValue; /* This variable is obsolete) */ char *inFile; /* Path name of the job's standard input file (optional) */ char *outFile; /* Path name of the job's standard output file (optional) */ char *errFile; /* Path name of the job's standard error output file (optional) */ char *command; /* Command line of the job */ time_t chkpntPeriod; /* Job is checkpointable with this period (optional) */ char *chkpntDir; /* Directory for this job's chk directory (optional) */ int nxf; /* Size of xf (optional) */ struct xFile *xf; /* An array of file transfer specifications (optional) */ char *preExecCmd; /* Job's pre-execution command (optional) */ char *mailUser; /* User E-mail address to which the job's output are mailed (optional) */ int delOptions; /* Bits to be removed from options (lsb_modify() only) */ char *projectName; /* Name of the job's project (optional) */ int maxNumProcessors; /* Requested maximum num of job slots for the job */ char *loginShell; /* Login shell to be used to re-initialize environment */ };
For a complete description of the fields in the submit structure, see the lsb_submit(3) man page.
The submitReply structure is defined in lsbatch.h as:
struct submitReply { char *queue; /* The queue name the job was submitted to */ int badJobId; /* dependCond contains badJobId but there is no such job */ char *badJobName; /* dependCond contains badJobName but there is no such job */ int badReqIndx; /* Index of a host or resource limit that caused an error */ };
The last three variables in the structure submitReply are only used when lsb_submit() or lsb_modify() function call fails.
For a complete description of the fields in the submitReply structure, see the lsb_submit(3) man page.
To submit a new job, all you have to do is to fill out this data structure and then call lsb_submit(). The delOptions variable is ignored by LSF Batch system for lsb_submit() function call.
The example job submission program below takes the job command line as an argument and submits the job to the LSF Batch system. For simplicity, it assumes that the command does not have arguments.
#include <stdio.h> #include <lsf/lsbatch.h> main(argc, argv) int argc; char **argv; { struct submit req; struct submitReply reply; int jobId; int i; if (argc != 2) { fprintf(stderr, "Usage: %s command\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } req.options = 0; req.resReq = NULL; for (i = 0; i < LSF_RLIM_NLIMITS; i++) req.rLimits[i] = DEFAULT_NUMBER; req.hostSpec = NULL; req.numProcessors = 1; req.beginTime = 0; req.termTime = 0; req.command = argv[1]; req.nxf = 0; req.delOptions = 0; jobId = lsb_submit(&req, &reply); if (jobId < 0) { switch (lsberrno) { case LSBE_QUEUE_USE: case LSBE_QUEUE_CLOSED: lsb_perror(reply.queue); exit(-1); default: lsb_perror(NULL); exit(-1); } } printf("Job <%d> is submitted to default queue <%s>.\n", jobId, reply.queue); exit(0); }
The options field of the submit structure is the bitwise inclusive OR of some of the SUB_* flags defined in lsbatch.h. These flags serve two purposes. Some flags indicate which of the optional fields of the submit structure are present. Those that are not present have default values. Other flags indicate submission options. For a description of these flags, see lsb_submit(3).
Since options indicate which of the optional fields are meaningful, you do not need to initialize the fields that are will be chosen by options. All parameters that are not optional must be initialized properly.
If the resReq field of the submit structure is NULL, LSBLIB will try to obtain resource requirements for command from the remote task list (see 'Getting Task Resource Requirements'). If the task does not appear in the remote task list, then NULL is passed to the LSF Batch system. mbatchd will then use the default resource requirements with option DFT_FROMTYPE bit set when making a LSLIB call for host selection from LIM. See 'Handling Default Resource Requirements' for more information about default resource requirements.
The constant DEFAULT_NUMBER defined in lsbatch.h indicates that there is no limit on a resource.
The constants used to index the rlimits array of the submit structure are defined in lsf.h, and the resource limits currently supported by LSF Batch are listed in Table 3.
The hostSpec field of the submit structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU] and rlimits[LSF_RLIMIT_RUN] (See lsb_queueinfo(3)). If hostSpec is NULL, the local host's model is assumed.
If the beginTime field of the submit structure is 0, start the job as soon as possible.
If the termTime field of the submit structure is 0, allow the job to run until it reaches a resource limit.
The above example checks the value of lsberrno when lsb_submit() fails. Different actions can be taken depending on the type of the error. All possible error numbers are defined in lsbatch.h. For example, error number LSBE_QUEUE_USE indicates that the user is not authorized to use the queue. The error number LSBE_QUEUE_CLOSED indicates that the queue is closed.
Since a queue name was not specified for the job, the job will be submitted to the default queue. The queue field of the submitReply structure contains the name of the queue to which the job was submitted.
The above program will produce output similar to the following:
Job <5602> is submitted to default queue <default>.
The output from the job will be mailed to the user because it did not specify a file name for the outFile parameter in the submit structure.
To modify an already submitted job, you can fill out a new submit structure to override existing parameters, and use delOptions to remove option bits that were previously specified for the job. Essentially, modifying a submitted job is like re-submitting the job. So the same program as above can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions can also be set if you want to clear some option bits that were set previously.
Note
All applications that call lsb_submit() and lsb_modify() are subject to authentication constraints described in 'Authentication'.
LSBLIB provides functions to get status information about batch jobs. Since the number of jobs in the LSF Batch system could be on the order of many thousands, getting all this information in one message could potentially use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This way the memory space needed is always the size of one job record.
The function calls used to get job information are:
int lsb_openjobinfo(jobId, jobName, user, queue, host, options); struct jobInfoEnt *lsb_readjobinfo(more); void lsb_closejobinfo(void);
These functions are used to open a job information connection with mbatchd, read job records, and then close the job information connection.
lsb_openjobinfo() function takes the following arguments:
int jobId; /* Select job with the given job Id */ char *jobName; /* Select job(s) with the given job name */ char *user; /* Select job(s) submitted by the named user or user group */ char *queue; /* Select job(s) submitted to the named queue */ char *host; /* Select job(s) that are dispatched to the named host */ int options; /* Selection flags constructed from the bits defined in lsbatch.h */
The options parameter contains additional job selection flags defined in lsbatch.h. These are:
If options is 0, then the default is CUR_JOB.
lsb_openjobinfo() returns the total number of matching job records in the connection. It returns -1 on failure and sets lsberrno to indicate the error.
lsb_readjobinfo() takes one argument:
int *more; /* If not NULL, contains the remaining number of jobs unread */
Either this parameter or the return value from the lsb_openjobinfo() can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each time lsb_readjobinfo() is called.
lsb_closejobinfo() should be called after receiving all job records in the connection.
Below is an example of a simplified bjobs command. This program displays all pending jobs belonging to all users.
#include <stdio.h> #include <lsf/lsbatch.h> main() { int options = PEND_JOB; char *user = "all"; /* match jobs for all users */ struct jobInfoEnt *job; int more; if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options) < 0) { lsb_perror("lsb_openjobinfo"); exit(-1); } printf("All pending jobs submitted by all users:\n"); for (;;) { job = lsb_readjobinfo(&more); if (job == NULL) { lsb_perror("lsb_readjobinfo"); exit(-1); } /* display the job */ printf("%s:\nJob <%d> of user <%s>, submitted from host <%s>\n", ctime(&job->submitTime), job->jobId, job->user, job->fromHost); if (! more) break; } lsb_closejobinfo(); exit(0); }
If you want to print out the reasons why the job is still pending, you can use the function lsb_pendreason(). See lsb_pendreason(3) for details.
The above program will produce output similar to the following:
All pending jobs submitted by all users: Mon Mar 1 10:34:04 EST 1996: Job <123> of user <john>, submitted from host <orange> Mon Mar 1 11:12:11 EST 1996: Job <126> of user <john>, submitted from host <orange> Mon Mar 1 14:11:34 EST 1996: Job <163> of user <ken>, submitted from host <apple> Mon Mar 1 15:00:56 EST 1996: Job <199> of user <tim>, submitted from host <pear>
After a job has been submitted, it can be manipulated by users in different ways. It can be suspended, resumed, killed, or sent an arbitrary signal.
Note
All applications that manipulate jobs are subject to authentication provisions described in 'Authentication'.
Users can send signals to submitted jobs. If the job has not been started, you can send KILL, TERM, INT, and STOP signals. These will cause the job to be cancelled (KILL, TERM, or INT) or suspended (STOP). If the job is already started, then any signals can be sent to the job.
The LSBLIB call to send a signal to a job is:
int lsb_signaljob(jobId, sigValue);
The jobId and sigValue parameters are self-explanatory.
The following example takes a job ID as the argument and send a SIGSTOP signal to the job.
#include <stdio.h> #include <lsf/lsbatch.h> main(argc, argv) int argc; char *argv[]; { if (argc != 2) { printf("Usage: %s jobId\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } if (lsb_signaljob(argv[1], SIGSTOP) <0) { lsb_perror("lsb_signaljob"); exit(-1); } printf("Job %d is signaled\n", argv[1]); exit(0); }
A job can be switched to a different queue after submission. This can be done even after the job has already started.
The LSBLIB function to switch a job from one queue to another is:
int lsb_switchjob(jobId, queue);
Below is an example program that switches a specified job to the specified new queue.
#include <stdio.h> #include <lsf/lsbatch.h> main(argc, argv) int argc; char *argv[]; { if (argc != 3) { printf("Usage: %s jobId new_queue\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) <0) { lsb_perror("lsb_init"); exit(-1); } if (lsb_switchjob(argv[1], argv[2]) < 0) { lsb_perror("lsb_switchjob"); exit(-1); } printf("Job %d is switched to new queue <%s>\n", argv[1], argv[2]); exit(0); }
LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in files lsb.events and lsb.acct under the directory LSB_SHAREDIR/cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and cluster is the name of your LSF cluster.
mbatchd logs such information for several purposes. Firstly, some of the events serve as the backup of mbatchd's memory so that in case mbatchd crashes, all the critical information can be picked up by a newly started mbatchd from the event file to restore the current state of LSF Batch. Secondly, the events can be used to produce historical information about the LSF Batch system and user jobs. Lastly, such information can be used to produce accounting or statistic reports.
CAUTION!
The lsb.events file contains critical user job information. It should never be modified by your program. Writing into this file may cause the loss of user jobs.
LSBLIB provides a function to read this information from these files into a well-defined data structure:
struct eventRec *lsb_geteventrec(log_fp, lineNum)
FILE *log_fp; /* File handle for either an event log file or job log file */ int *lineNum; /* Line number of the next event record */
The parameter log_fp is as returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.
This call returns the following data structure:
struct eventRec { char version[MAX_VERSION_LEN]; /* Version number of the mbatchd */ int type; /* Type of the event */ int eventTime; /* Event time stamp */ union eventLog eventLog; /* Event data */ };
The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.
lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.
Events are logged by mbatchd for many different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist command processes job-related events only. The currently available event types are listed in Table 4.
Event type | Description |
---|---|
EVENT_JOB_NEW |
New job event |
EVENT_JOB_START |
mbatchd is trying to start a job |
EVENT_JOB_STATUS |
Job status change event |
EVENT_JOB_SWITCH |
Job switched to a new queue |
EVENT_JOB_MOVE |
Job moved within a queue |
EVENT_QUEUE_CTRL |
Queue status changed by LSF admin |
EVENT_HOST_CTRL |
Host status changed by LSF admin |
EVENT_MBD_START |
New mbatchd start event |
EVENT_MBD_DIE |
mbatchd resign event |
EVENT_MBD_UNFULFILL |
mbatchd has an action to be fulfilled |
EVENT_JOB_FINISH |
Job has finished (logged in lsb.acct only) |
EVENT_LOAD_INDEX |
Complete list of load index names |
EVENT_MIG |
Job has migrated |
EVENT_PRE_EXEC_START |
The pre-execution command started |
EVENT_JOB_ROUTE |
The job has been routed to NQS |
EVENT_JOB_MODIFY |
The job has been modified |
EVENT_JOB_SIGNAL |
Job signal to be delivered |
EVENT_CAL_NEW |
New calendar event 1. |
EVENT_CAL_MODIFY |
Calendar modified 1. |
EVENT_CAL_DELETE |
Calendar deleted 1. |
EVENT_JOB_FORWARD |
Job forwarded to another cluster |
EVENT_JOB_ACCEPT |
Job from a remote cluster dispatched |
EVENT_STATUS_ACK |
Job status successfully sent to submission cluster |
EVENT_JOB_EXECUTE |
Job started successfully |
EVENT_JOB_REQUEUE |
Job is requeued |
EVENT_JOB_SIGACT |
An signal action on a job has been initiated or finished |
EVENT_JOB_START_ACCEPT |
Job accepted by sbatchd |
1. Available only if the LSF JobScheduler component is enabled. |
Note that the event type EVENT_JOB_FINISH is used by the lsb.acct file only and all other event types are used by the lsb.events file only. For detailed formats of these log files, see lsb.events(5) and lsb.acct(5).
Each event type corresponds to a different data structure in the union:
union eventLog { struct jobNewLog jobNewLog; /* EVENT_JOB_NEW */ struct jobStartLog jobStartLog; /* EVENT_JOB_START */ struct jobStatusLog jobStatusLog; /* EVENT_JOB_STATUS */ struct jobSwitchLog jobSwitchLog; /* EVENT_JOB_SWITCH */ struct jobMoveLog jobMoveLog; /* EVENT_JOB_MOVE */ struct queueCtrlLog queueCtrlLog; /* EVENT_QUEUE_CTRL */ struct hostCtrlLog hostCtrlLog; /* EVENT_HOST_CTRL */ struct mbdStartLog mbdStartLog; /* EVENT_MBD_START */ struct mbdDieLog mbdDieLog; /* EVENT_MBD_DIE */ struct unfulfillLog unfulfillLog; /* EVENT_MBD_UNFULFILL */ struct jobFinishLog jobFinishLog; /* EVENT_JOB_FINISH */ struct loadIndexLog loadIndexLog; /* EVENT_LOAD_INDEX */ struct migLog migLog; /* EVENT_MIG */ struct calendarLog calendarLog; /* Shared by all calendar events */ struct jobForwardLog jobForwardLog; /* EVENT_JOB_FORWARD */ struct jobAcceptLog jobAcceptLog; /* EVENT_JOB_ACCEPT */ struct statusAckLog statusAckLog; /* EVENT_STATUS_ACK */ struct signalLog signalLog; /* EVENT_JOB_SIGNAL */ struct jobExecuteLog jobExecuteLog; /* EVENT_JOB_EXECUTE */ struct jobRequeueLog jobRequeueLog; /* EVENT_JOB_REQUEUE */ struct sigactLog sigactLog; /* EVENT_JOB_SIGACT */ struct jobStartAcceptLog jobStartAcceptLog; /* EVENT_JOB_START_ACCEPT */ };
The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).
Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir:
#include <stdio.h> #include <time.h> #include <lsf/lsbatch.h> main(argc, argv) int argc; char *argv[]; { char *eventFile = "/local/lsf/work/cluster1/logdir/lsb.events"; FILE *fp; struct eventRec *recrod; int lineNum = 0; char *jobName = argv[1]; int i; if (argc != 2) { printf("Usage: %s jobname\n", argv[0]); exit(-1); } if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } fp = fopen(eventFile, "r"); if (fp == NULL) { perror(eventFile); exit(-1); } for (;;) { record = lsb_geteventrec(fp, &lineNum); if (record == NULL) { if (lsberrno == LSBE_EOF) exit(0); lsb_perror("lsb_geteventrec"); exit(-1); } if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0) continue; switch (record->type) { struct jobNewLog *newJob; struct jobStartLog *startJob; struct jobStatusLog *statusLog; case EVENT_JOB_NEW: newJob = &(record->eventLog.jobNewLog); printf("%s: job <%d> submitted by <%s> from <%s> to <%s> queue\n", ctime(&record->eventTime), newJob->jobId, newJob->userName, newJob->fromHost, newJob->queue); continue; case EVENT_JOB_START: startJob = &(record->eventLog.jobStartLog); printf("%s: job <%d> started on ", ctime(&record->eventTime), newJob->jobId); for (i=0; i<startJob->numExHosts; i++) printf("<%s> ", startJob->execHosts[i]); printf("\n"); continue; case EVENT_JOB_STATUS: statusJob = &(record->eventLog.jobStatusLog); printf("%s: Job <%d> status changed to: ", ctime(&record->eventTime), statusJob->jobId); switch(statusJob->jStatus) { case JOB_STAT_PEND: printf("pending\n"); continue; case JOB_STAT_RUN: printf("running\n"); continue; case JOB_STAT_SSUSP: case JOB_STAT_USUSP: case JOB_STAT_PSUSP: printf("suspended\n"); continue; case JOB_STAT_UNKWN: printf("unknown (sbatchd unreachable)\n"); continue; case JOB_STAT_EXIT: printf("exited\n"); continue; case JOB_STAT_DONE: printf("done\n"); continue; default: printf("\nError: unknown job status %d\n", statusJob->jStatus); continue; } default: /* only display a few selected event types*/ continue; } } exit(0); }
Note that in the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information and can be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct file, the processing is simpler than the above example.
Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.