[Contents] [Prev] [Next] [Bottom]


Chapter 3. Programming with LSBLIB


This chapter shows how to use LSBLIB to access the services provided by LSF Batch and LSF JobScheduler. Since LSF Batch and LSF JobScheduler are built on top of the LSF Base, LSBLIB relies on services provided by LSLIB. Thus if you use LSBLIB functions, you must link your program with both LSLIB and LSBLIB.

LSF Batch and LSF JobScheduler services are mostly provided by mbatchd, except services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and LSF JobScheduler. The functions described for LSF Batch in this chapter also apply to LSF JobScheduler, unless explicitly indicated otherwise.

Initializing LSF Batch Applications

Before accessing any of the services provided by the LSF Batch and LSF JobScheduler, an application must initialize LSBLIB. It does this by calling the following function:

int lsb_init()>lsb_init(appname);

On success, it returns 0; otherwise, it returns -1 and sets lsberrno to indicate the error.

The parameter appname is used only if you want to log detailed messages about the transactions inside LSLIB for debugging purpose. The messages will be logged only if LSB_CMD_LOG_MASK is defined as LOG_DEBUG1.

The messages will be logged in file LSF_LOGDIR/appname. If appname is NULL, the log file is LSF_LOGDIR/bcmd.

Note
This function must be called before any other function in LSBLIB can be called.

Getting Information about LSF Batch Queues

LSF Batch queues hold the jobs in the LSF Batch and set scheduling policies and limits on resource usage.

LSBLIB provides a function to get information about the queues in the LSF Batch. This includes queue name, parameters, statistics, status, resource limits, scheduling policies and parameters, and users and hosts associated with the queue.

The example program under discussion in this section uses the following LSBLIB function to get the queue information:

struct queueInfoEnt *lsb_queueinfo()>lsb_queueinfo(queues,numQueues,hostname,username,options)

On success, this function returns an array containing a queueInfoEnt structure (see below) for each queue of interest and sets *numQueues to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:

char  **queues;   /* An array containing names of queues of interest */
int   *numQueues; /* The number of names in queues */
char  *hostname;  /* Only queues using hostname are of interest */
char  *username;  /* Only queues enabled for user are of interest */
int   options;    /* Reserved for future use; supply 0 */

To get information on all queues, set *numQueues to 0; *numQueues will be updated to the actual number of queues returned on a successful return.

If *numQueues is 1 and queue is NULL, information on the system default queue is returned.

If hostname is not NULL, then all queues using host hostname as a batch server host will be returned. If username is not NULL, then all queues allowing user username to submit jobs to will be returned.

The queueInfoEnt structure is defined in lsbatch.h as:

struct queueInfoEnt {
    char  *queue;                    /* Name of the queue */
    char  *description;              /* Description of the queue */
    int   priority;                  /* Priority of the queue */
    short nice;                      /* Nice value at which jobs in the queue will be run */
    char  *userList;                 /* Users allowed to submit jobs to the queue */
    char  *hostList;                 /* Hosts to which jobs in the queue may be dispatched */
    int   nIdx;                      /* Size of the loadSched and loadStop arrays */
    float *loadSched;                /* Load thresholds that control scheduling of jobs from the queue */
    float *loadStop;                 /* Load thresholds that control suspension of jobs from the queue */
    int   userJobLimit;              /* Number of unfinished jobs a user can dispatch from the queue */
    int   procJobLimit;              /* Number of unfinished jobs the queue can dispatch to a processor */
    char  *windows;                  /* Queue run window */
    int   rLimits[LSF_RLIM_NLIMITS]; /* The per-process resource limits for jobs */
    char  *hostSpec;                 /* Obsolete. Use defaultHostSpec instead */
    int   qAttrib;                   /* Attributes of the queue */
    int   qStatus;                   /* Status of the queue */
    int   maxJobs;                   /* Job slot limit of the queue. */
    int   numJobs;                   /* Total number of job slots required by all jobs */
    int   numPEND;                   /* Number of  job slots needed by pending jobs */
    int   numRUN;                    /* Number of jobs slots used by  running jobs */
    int   numSSUSP;                  /* Number of  job slots used by system suspended jobs */
    int   numUSUSP;                  /* Number of  jobs slots used by user suspended jobs  */
    int   mig;                       /* Queue migration threshold in minutes */
    int   schedDelay;                /* Schedule delay for new jobs */
    int   acceptIntvl;               /* Minimum interval between two jobs dispatched to the same host */
    char  *windowsD;                 /* Queue dispatch window */
    char  *nqsQueues;                /* A blank-separated list of NQS queue specifiers */
    char  *userShares;               /* A blank-separated list of user shares */
    char  *defaultHostSpec;          /* Value of DEFAULT_HOST_SPEC for the queue in lsb.queues */
    int   procLimit;                 /* Maximum number of job slots a job can take */
    char  *admins;                   /* Queue level administrators */
    char  *preCmd;                   /* Queue level pre-exec command */
    char  *postCmd;                  /* Queue's post-exec command */
    char  *requeueEValues;           /* Queue's requeue exit status */ 
    int   hostJobLimit;              /* Per host job slot limit */
    char  *resReq;                   /* Queue level resource requirement */
    int   numRESERVE;                /* Reserved job slots for pending jobs */
    int   slotHoldTime;              /* Time period for reserving job slots */
    char  *sndJobsTo;                /* Remote queues to forward jobs to */
    char  *rcvJobsFrom;              /* Remote queues which can forward to me */
    char  *resumeCond;               /* Conditions to resume jobs */
    char  *stopCond;                 /* Conditions to suspend jobs */
    char  *jobStarter;               /* Queue level job starter */
    char  *suspendActCmd;            /* Action commands for SUSPEND */
    char  *resumeActCmd;             /* Action commands for RESUME */
    char  *terminateActCmd;          /* Action commands for TERMINATE */
    int   sigMap[LSB_SIG_NUM];       /* Configurable signal mapping */
    char  *preemption;               /* Preemption policy */
    int   maxRschedTime;             /* Time period for remote cluster to schedule job */
};

The variable nIdx is the number of load threshold values for job scheduling. This is in fact the total number of load indices as returned by LIM. The parameters sndJobsTo, rcvJobsFrom, and maxRschedTime are only used with LSF MultiCluster.

For a complete description of the fields in the queueInfoEnt structure, see the lsb_queueinfo(3) man page.

The program below takes the first argument as a queue name and displays information about the named queue.

#include <stdio.h>
#include <lsf/lsbatch.h>

int 
main (argc, argv)
    int  argc;
    char *argv[];
{
    struct queueInfoEnt *qInfo;
    int  numQueues = 1;
    char *queue=argv[1];
    int  i;

    if (argc != 2) {
        printf("Usage: %s queue_name\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init()");
        exit(-1);
    }

    qInfo = lsb_queueinfo(&queue, &numQueues, NULL, NULL, 0); 
    if (qInfo == NULL) { 
        lsb_perror("lsb_queueinfo()");
        exit(-1);
    }

    printf("Information about %s queue:\n", queue);
    printf("Description: %s\n", qInfo[0].description);
    printf("Priority: %d                     Nice:     %d     \n",
            qInfo[0].priority, qInfo[0].nice);
    printf("Maximum number of job slots:");
    if (qip->maxJobs < INFINIT_INT)
        printf("%5d\n", qInfo[0].maxJobs);
    else
        printf("%5s\n", "unlimited");

  printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) TOTAL(%d).\n",
         qInfo[0].numPEND, qInfo[0].numRUN,
         qInfo[0].numSSUSP + qInfo[0].numUSUSP, qInfo[0].numJobs);

    exit(0);
}

The header file lsbatch.h must be included with every application that uses LSBLIB functions. Note that lsf.h does not have to be explicitly included in your program because lsbatch.h already has lsf.h included. The function lsb_perror() is used in much the same way ls_perror() is used to print error messages regarding function call failure. You could check lsberrno if you want to take different actions for different errors.

In the above program, INFINIT_INT is defined in lsf.h and is used to indicate that there is no limit set for maxJobs. This applies to all LSF API function calls. LSF will supply INFINIT_INT automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you were to display the loadSched/loadStop values, an INFINIT_INT indicates that the threshold is not configured and is ignored.

Note
Like the returned data structures by LSLIB functions, the returned data structures from an LSBLIB function is dynamically allocated inside LSBLIB and is automatically freed next time the same function is called. You should not attempt to free the space allocated by LSBLIB. If you need to keep this information across calls, make your own copy of the data structure.

The above program will produce output similar to the following:

Information about normal queue:
Description: For normal low priority jobs
Priority: 25            Nice: 20
Maximum number of job slots : 40
Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)

Getting Information about LSF Batch Hosts

LSF Batch server hosts execute the jobs in the LSF Batch system.

LSBLIB provides a function to get information about the server hosts in the LSF Batch system. This includes both configured static information as well as dynamic information. Examples of host information include host name, status, job limits and statistics, dispatch windows and scheduling parameters.

The example program to be discussed in this section uses the following LSBLIB function:

struct hostInfoEnt *lsb_hostinfo()>lsb_hostinfo(hosts, numHosts)

This function gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt structures which hold the host information and sets *numHosts to the size of the array. On failure, it returns NULL and sets lsberrno to indicate the error. It has the following parameters:

char  **hosts;      /* An array of names of hosts of interest */
int   *numHosts;    /* The number of names in hosts */

To get information on all hosts, set *numHosts to 0; *numHosts will be set to the actual number of hostInfoEnt structures when this call returns successfully.

If *numHosts is 1 and hosts is NULL, information on the local host is returned.

The hostInfoEnt structure is defined in lsbatch.h as

struct hostInfoEnt {
    char  *host;        /* Name of the host */
    int   hStatus;      /* Status of host. (see below) */
    int   busySched;    /* Reason host will not schedule jobs */
    int   busyStop;     /* Reason host has suspended jobs */
    float cpuFactor;    /* Host CPU factor, as returned by LIM */
    int   nIdx;         /* Size of the loadSched and loadStop arrays, as returned from LIM */
    float *load;        /* Load LSF Batch used for scheduling batch jobs */
    float *loadSched;   /* Load thresholds that control scheduling of jobs on host */
    float *loadStop;    /* Load thresholds that control suspension of jobs on host */
    char  *windows;     /* Host dispatch window */
    int   userJobLimit; /* Maximum number of jobs a user can run on host */
    int   maxJobs;      /* Maximum number of jobs that host can process concurrently */
    int   numJobs;      /* Number of jobs running or suspended on host */
    int   numRUN;       /* Number of jobs running on host */
    int   numSSUSP;     /* Number of jobs suspended by sbatchd on host */
    int   numUSUSP;     /* Number of jobs suspended by a user on host */
    int   mig;          /* Migration threshold for jobs on host */
    int   attr;         /* Host attributes */
#define H_ATTR_CHKPNTABLE 0x1
#define H_ATTR_CHKPNT_COPY 0x2
    float *realLoad;    /* The load mbatchd obtained from LIM */
    int   numRESERVE;   /* Num of slots reserved for pending jobs */
    int   chkSig;       /* This variable is obsolete */
};

Note the differences between host information returned by LSLIB function ls_gethostinfo() and host information returned by the LSBLIB function lsb_hostinfo(). The former returns general information about the hosts whereas the latter returns LSF Batch specific information about hosts.

For a complete description of the fields in the hostInfoEnt structure, see the lsb_hostinfo(3) man page.

The example program below takes a host name as an argument and displays various information about the named host. It is a simplified version of the LSF Batch bhosts command:

#include <stdio.h>
#include <lsf/lsbatch.h>

main (argc, argv)
    int  argc;
    char *argv[];
{
    struct hostInfoEnt *hInfo;
    int  numHosts = 1;
    char *hostname = argv[1];
    int  i;

    if (argc != 2) { 
        printf("Usage: %s hostname\n", argv[0]);
        exit(-1);
    }
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    hInfo = lsb_hostinfo(&hostname, &numHosts);

    if (hInfo == NULL) {
        lsb_perror("lsb_hostinfo");
        exit (-1);
    }

    printf("HOST_NAME    STATUS    JL/U  NJOBS  RUN  SSUSP USUSP\n");

    printf ("%-18.18s", hInfo->host);

    if (hInfo->hStatus & HOST_STAT_UNLICENSED) {
        printf(" %-9s\n", "unlicensed");    
        continue;                 /* don't print other info */
    } else if (hInfo->hStatus & HOST_STAT_UNAVAIL)
        printf(" %-9s",  "unavail");
    else if (hInfo->hStatus & HOST_STAT_UNREACH)
        printf(" %-9s", "unreach");
    else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND
            | HOST_STAT_DISABLED | HOST_STAT_LOCKED
            | HOST_STAT_FULL | HOST_STAT_NO_LIM))
        printf(" %-9s", "closed");
    else
        printf(" %-9s", "ok");

    if (hInfo->userJobLimit < INFINIT_INT)
        printf("%4d", hInfo->userJobLimit);
    else
        printf("%4s", "-");

    printf("%7d  %4d  %4d  %4d\n",
        hInfo->numJobs, hInfo->numRUN, hInfo->numSSUSP, hInfo->numUSUSP);

    exit(0);

}

hStatus is the status of the host. It is the bitwise inclusive OR of some of the following constants defined in lsbatch.h:

HOST_STAT_BUSY
The host load is greater than a scheduling threshold. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_WIND
The host dispatch window is closed. In this status, no new batch job will be accepted.
HOST_STAT_DISABLED
The host has been disabled by the LSF administrator and will not accept jobs. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_LOCKED
The host is locked by an exclusive job. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_FULL
The host has reached its job limit. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_UNREACH
The sbatchd on this host is unreachable.
HOST_STAT_UNAVAIL
The LIM and sbatchd on this host are unreachable.
HOST_STAT_UNLICENSED
The host does not have an LSF license.
HOST_STAT_NO_LIM
The host is running an sbatchd but not a LIM.

If none of the above holds, hStatus is set to HOST_STAT_OK to indicate that the host is ready to accept and run jobs.

The constant INFINIT_INT defined in lsf.h is used to indicate that there is no limit set for userJobLimit.

The example output from the above program follows:

% a.out hostB
HOST_NAME    STATUS    JL/U  NJOBS  RUN  SSUSP USUSP
hostB         ok        -     2      1    1     0

Job Submission and Modification

Job submission and modification are most common operations in the LSF Batch system. A user can submit jobs to the system and then modify them if the job has not been started.

LSBLIB provides one function for job submission and one function for job modification:

int lsb_submit(jobSubReq, jobSubReply)
int lsb_modify(jobSubReq, jobSubReply, jobId)

On success, these calls return the job ID. Otherwise, -1 is returned with lsberrno set to indicate the error. These two functions are similar except that lsb_modify() modifies the parameters of an already submitted job.

Both of these functions use the same data structure:

struct submit      *jobSubReq;   /* Job specifications */
struct submitReply *jobSubReply; /* Results of job submission */
int    jobId;                    /* Id of the job to modify (lsb_modify() only) */

The submit structure is defined in lsbatch.h as:

struct submit {
    int    options;          /* Indicates which optional fields are present */
    char   *jobName;         /* Job name (optional) */
    char   *queue;           /* Submit the job to this queue (optional) */
    int    numAskedHosts;    /* Size of askedHosts (optional) */
    char   **askedHosts;     /* An array of names of candidate hosts (optional) */
    char   *resReq;          /* Resource requirements of the job (optional) */
    int    rlimits[LSF_RLIM_NLIMITS];
                             /* Limits on system resource use by all of the job's processes */
    char   *hostSpec;        /* Host model used for scaling rlimits (optional) */
    int    numProcessors;    /* Initial number of processors needed by the job */
    char   *dependCond;      /* Job dependency condition (optional) */
    time_t beginTime;        /* Dispatch the job on or after beginTime */
    time_t termTime;         /* Job termination deadline */
    int    sigValue;         /* This variable is obsolete) */
    char   *inFile;          /* Path name of the job's standard input file (optional) */
    char   *outFile;         /* Path name of the job's standard output file (optional) */
    char   *errFile;         /* Path name of the job's standard error output file (optional) */
    char   *command;         /* Command line of the job */
    time_t chkpntPeriod;     /* Job is checkpointable with this period (optional) */
    char   *chkpntDir;       /* Directory for this job's chk directory (optional) */
    int    nxf;              /* Size of xf (optional) */
    struct xFile *xf;        /* An array of file transfer specifications (optional) */
    char   *preExecCmd;      /* Job's pre-execution command (optional) */
    char   *mailUser;        /* User E-mail address to which the job's output are mailed (optional) */
    int    delOptions;       /* Bits to be removed from options (lsb_modify() only) */
    char   *projectName;     /* Name of the job's project (optional) */
    int    maxNumProcessors; /* Requested maximum num of job slots for the job */
    char   *loginShell;      /* Login shell to be used to re-initialize environment */
};

For a complete description of the fields in the submit structure, see the lsb_submit(3) man page.

The submitReply structure is defined in lsbatch.h as:

struct submitReply {
    char   *queue;      /* The queue name the job was submitted to */
    int    badJobId;    /* dependCond contains badJobId but there is no such job */
    char   *badJobName; /* dependCond contains badJobName but there is no such job */
    int    badReqIndx;  /* Index of a host or resource limit that caused an error */
};

The last three variables in the structure submitReply are only used when lsb_submit() or lsb_modify() function call fails.

For a complete description of the fields in the submitReply structure, see the lsb_submit(3) man page.

To submit a new job, all you have to do is to fill out this data structure and then call lsb_submit(). The delOptions variable is ignored by LSF Batch system for lsb_submit() function call.

The example job submission program below takes the job command line as an argument and submits the job to the LSF Batch system. For simplicity, it assumes that the command does not have arguments.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char **argv;
{
    struct submit  req;
    struct submitReply  reply;
    int  jobId;
    int  i;

    if (argc != 2) {
        fprintf(stderr, "Usage: %s command\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    req.options = 0;
    req.resReq = NULL;

    for (i = 0; i < LSF_RLIM_NLIMITS; i++)
        req.rLimits[i] = DEFAULT_NUMBER;

    req.hostSpec = NULL;
    req.numProcessors = 1;
    req.beginTime = 0;
    req.termTime  = 0;
    req.command = argv[1];
    req.nxf = 0;
    req.delOptions = 0;

    jobId = lsb_submit(&req, &reply);

    if (jobId < 0) {
        switch (lsberrno) {
        case LSBE_QUEUE_USE:
        case LSBE_QUEUE_CLOSED:
            lsb_perror(reply.queue);
            exit(-1);
        default:
            lsb_perror(NULL);
            exit(-1);
        }
    }

    printf("Job <%d> is submitted to default queue <%s>.\n", jobId,
            reply.queue);

    exit(0);
}

The options field of the submit structure is the bitwise inclusive OR of some of the SUB_* flags defined in lsbatch.h. These flags serve two purposes. Some flags indicate which of the optional fields of the submit structure are present. Those that are not present have default values. Other flags indicate submission options. For a description of these flags, see lsb_submit(3).

Since options indicate which of the optional fields are meaningful, you do not need to initialize the fields that are will be chosen by options. All parameters that are not optional must be initialized properly.

If the resReq field of the submit structure is NULL, LSBLIB will try to obtain resource requirements for command from the remote task list (see 'Getting Task Resource Requirements'). If the task does not appear in the remote task list, then NULL is passed to the LSF Batch system. mbatchd will then use the default resource requirements with option DFT_FROMTYPE bit set when making a LSLIB call for host selection from LIM. See 'Handling Default Resource Requirements' for more information about default resource requirements.

The constant DEFAULT_NUMBER defined in lsbatch.h indicates that there is no limit on a resource.

The constants used to index the rlimits array of the submit structure are defined in lsf.h, and the resource limits currently supported by LSF Batch are listed in Table 3.

Table 3. Resource Limits Supported by LSF Batch

Resource limit Index in rlimits array
CPU time limit
LSF_RLIMIT_CPU 
File size limit
LSF_RLIMIT_FSIZE 
Data size limit
LSF_RLIMIT_DATA 
Stack size limit
LSF_RLIMIT_STACK 
Core file size limit
LSF_RLIMIT_CORE 
Resident memory size limit
LSF_RLIMIT_RSS 
Number of open files limit
LSF_RLIMIT_OPEN_MAX            
Virtual memory limit
LSF_RLIMIT_SWAP 
Wall-clock time run limit
LSF_RLIMIT_RUN 
Maximum num of processes a job can fork
LSF_RLIMIT_PROCESS 

The hostSpec field of the submit structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU] and rlimits[LSF_RLIMIT_RUN] (See lsb_queueinfo(3)). If hostSpec is NULL, the local host's model is assumed.

If the beginTime field of the submit structure is 0, start the job as soon as possible.

If the termTime field of the submit structure is 0, allow the job to run until it reaches a resource limit.

The above example checks the value of lsberrno when lsb_submit() fails. Different actions can be taken depending on the type of the error. All possible error numbers are defined in lsbatch.h. For example, error number LSBE_QUEUE_USE indicates that the user is not authorized to use the queue. The error number LSBE_QUEUE_CLOSED indicates that the queue is closed.

Since a queue name was not specified for the job, the job will be submitted to the default queue. The queue field of the submitReply structure contains the name of the queue to which the job was submitted.

The above program will produce output similar to the following:

Job <5602> is submitted to default queue <default>.

The output from the job will be mailed to the user because it did not specify a file name for the outFile parameter in the submit structure.

To modify an already submitted job, you can fill out a new submit structure to override existing parameters, and use delOptions to remove option bits that were previously specified for the job. Essentially, modifying a submitted job is like re-submitting the job. So the same program as above can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions can also be set if you want to clear some option bits that were set previously.

Note
All applications that call lsb_submit() and lsb_modify() are subject to authentication constraints described in 'Authentication'.

Getting Information about Batch Jobs

LSBLIB provides functions to get status information about batch jobs. Since the number of jobs in the LSF Batch system could be on the order of many thousands, getting all this information in one message could potentially use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This way the memory space needed is always the size of one job record.

The function calls used to get job information are:

int lsb_openjobinfo(jobId, jobName, user, queue, host, options);
struct jobInfoEnt *lsb_readjobinfo(more);
void lsb_closejobinfo(void);

These functions are used to open a job information connection with mbatchd, read job records, and then close the job information connection.

lsb_openjobinfo() function takes the following arguments:

int   jobId;    /* Select job with the given job Id */
char  *jobName; /* Select job(s) with the given job name */
char  *user;    /* Select job(s) submitted by the named user or user group */
char  *queue;   /* Select job(s) submitted to the named queue */
char  *host;    /* Select job(s) that are dispatched to the named host */
int   options;  /* Selection flags constructed from the bits defined in lsbatch.h */

The options parameter contains additional job selection flags defined in lsbatch.h. These are:

ALL_JOB
Select jobs matching any status, including unfinished jobs and recently finished jobs. LSF Batch remembers finished jobs within the CLEAN_PERIOD, as defined in the lsb.params file.
CUR_JOB
Return jobs that have not finished yet.
DONE_JOB
Return jobs that have finished recently.
PEND_JOB
Return jobs that are in the pending status.
SUSP_JOB
Return jobs that are in the suspended status.
LAST_JOB
Return jobs that are submitted most recently.

If options is 0, then the default is CUR_JOB.

lsb_openjobinfo() returns the total number of matching job records in the connection. It returns -1 on failure and sets lsberrno to indicate the error.

lsb_readjobinfo() takes one argument:

int   *more;  /* If not NULL, contains the remaining number of jobs unread */

Either this parameter or the return value from the lsb_openjobinfo() can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each time lsb_readjobinfo() is called.

lsb_closejobinfo() should be called after receiving all job records in the connection.

Below is an example of a simplified bjobs command. This program displays all pending jobs belonging to all users.

#include <stdio.h>
#include <lsf/lsbatch.h>

main()
{
    int  options = PEND_JOB;
    char *user = "all";             /* match jobs for all users */
    struct jobInfoEnt *job;
    int more;

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options) < 0) {
        lsb_perror("lsb_openjobinfo");
        exit(-1);
    }

    printf("All pending jobs submitted by all users:\n");
    for (;;) {
        job = lsb_readjobinfo(&more);
        if (job == NULL) {
            lsb_perror("lsb_readjobinfo");
            exit(-1);
        }
        /* display the job */
        printf("%s:\nJob <%d> of user <%s>, submitted from host <%s>\n",
                ctime(&job->submitTime), job->jobId, job->user, job->fromHost);

        if (! more) 
            break;
    }

    lsb_closejobinfo();
    exit(0);
}

If you want to print out the reasons why the job is still pending, you can use the function lsb_pendreason(). See lsb_pendreason(3) for details.

The above program will produce output similar to the following:

All pending jobs submitted by all users:
Mon Mar 1 10:34:04 EST 1996:
Job <123> of user <john>, submitted from host <orange>
Mon Mar 1 11:12:11 EST 1996:
Job <126> of user <john>, submitted from host <orange>
Mon Mar 1 14:11:34 EST 1996:
Job <163> of user <ken>, submitted from host <apple>
Mon Mar 1 15:00:56 EST 1996:
Job <199> of user <tim>, submitted from host <pear>

Job Manipulation

After a job has been submitted, it can be manipulated by users in different ways. It can be suspended, resumed, killed, or sent an arbitrary signal.

Note
All applications that manipulate jobs are subject to authentication provisions described in 'Authentication'.

Sending a Signal to a Job

Users can send signals to submitted jobs. If the job has not been started, you can send KILL, TERM, INT, and STOP signals. These will cause the job to be cancelled (KILL, TERM, or INT) or suspended (STOP). If the job is already started, then any signals can be sent to the job.

The LSBLIB call to send a signal to a job is:

int lsb_signaljob(jobId, sigValue);

The jobId and sigValue parameters are self-explanatory.

The following example takes a job ID as the argument and send a SIGSTOP signal to the job.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char *argv[];
{
    if (argc != 2) {
        printf("Usage: %s jobId\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_signaljob(argv[1], SIGSTOP) <0) {
        lsb_perror("lsb_signaljob");
        exit(-1);
    }

    printf("Job %d is signaled\n", argv[1]);
    exit(0);
}

Switching a Job to a Different Queue

A job can be switched to a different queue after submission. This can be done even after the job has already started.

The LSBLIB function to switch a job from one queue to another is:

int lsb_switchjob(jobId, queue);

Below is an example program that switches a specified job to the specified new queue.

#include <stdio.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int argc;
    char *argv[];
{
    if (argc != 3) {
        printf("Usage: %s jobId new_queue\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) <0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    if (lsb_switchjob(argv[1], argv[2]) < 0) {
        lsb_perror("lsb_switchjob");
        exit(-1);
    }

    printf("Job %d is switched to new queue <%s>\n", argv[1], argv[2]);
    
    exit(0);
}

Processing LSF Batch Log Files

LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in files lsb.events and lsb.acct under the directory LSB_SHAREDIR/cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and cluster is the name of your LSF cluster.

mbatchd logs such information for several purposes. Firstly, some of the events serve as the backup of mbatchd's memory so that in case mbatchd crashes, all the critical information can be picked up by a newly started mbatchd from the event file to restore the current state of LSF Batch. Secondly, the events can be used to produce historical information about the LSF Batch system and user jobs. Lastly, such information can be used to produce accounting or statistic reports.

CAUTION!
The lsb.events file contains critical user job information. It should never be modified by your program. Writing into this file may cause the loss of user jobs.

LSBLIB provides a function to read this information from these files into a well-defined data structure:

struct eventRec *lsb_geteventrec(log_fp, lineNum)

The parameters are:

FILE  *log_fp;  /* File handle for either an event log file or job log file */
int   *lineNum; /* Line number of the next event record */

The parameter log_fp is as returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.

This call returns the following data structure:

struct eventRec {
    char  version[MAX_VERSION_LEN]; /* Version number of the mbatchd */
    int   type;                     /* Type of the event */
    int   eventTime;                /* Event time stamp */
    union eventLog eventLog;        /* Event data */
};

The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.

lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.

Events are logged by mbatchd for many different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist command processes job-related events only. The currently available event types are listed in Table 4.

Table 4. Event Types

Event type Description
EVENT_JOB_NEW 
New job event
EVENT_JOB_START 
mbatchd is trying to start a job
EVENT_JOB_STATUS 
Job status change event
EVENT_JOB_SWITCH 
Job switched to a new queue
EVENT_JOB_MOVE 
Job moved within a queue
EVENT_QUEUE_CTRL 
Queue status changed by LSF admin
EVENT_HOST_CTRL 
Host status changed by LSF admin
EVENT_MBD_START 
New mbatchd start event
EVENT_MBD_DIE 
mbatchd resign event
EVENT_MBD_UNFULFILL 
mbatchd has an action to be fulfilled
EVENT_JOB_FINISH 
Job has finished (logged in lsb.acct only)
EVENT_LOAD_INDEX 
Complete list of load index names
EVENT_MIG 
Job has migrated
EVENT_PRE_EXEC_START 
The pre-execution command started
EVENT_JOB_ROUTE 
The job has been routed to NQS
EVENT_JOB_MODIFY 
The job has been modified
EVENT_JOB_SIGNAL 
Job signal to be delivered
EVENT_CAL_NEW 
New calendar event 1.
EVENT_CAL_MODIFY 
Calendar modified 1.
EVENT_CAL_DELETE 
Calendar deleted 1.
EVENT_JOB_FORWARD 
Job forwarded to another cluster
EVENT_JOB_ACCEPT 
Job from a remote cluster dispatched
EVENT_STATUS_ACK 
Job status successfully sent to submission cluster
EVENT_JOB_EXECUTE 
Job started successfully
EVENT_JOB_REQUEUE 
Job is requeued
EVENT_JOB_SIGACT 
An signal action on a job has been initiated or finished
EVENT_JOB_START_ACCEPT 
Job accepted by sbatchd
1. Available only if the LSF JobScheduler component is enabled.

Note that the event type EVENT_JOB_FINISH is used by the lsb.acct file only and all other event types are used by the lsb.events file only. For detailed formats of these log files, see lsb.events(5) and lsb.acct(5).

Each event type corresponds to a different data structure in the union:

union  eventLog { 
    struct jobNewLog     jobNewLog;             /* EVENT_JOB_NEW */
    struct jobStartLog   jobStartLog;           /* EVENT_JOB_START */
    struct jobStatusLog  jobStatusLog;          /* EVENT_JOB_STATUS */
    struct jobSwitchLog  jobSwitchLog;          /* EVENT_JOB_SWITCH */
    struct jobMoveLog    jobMoveLog;            /* EVENT_JOB_MOVE */
    struct queueCtrlLog  queueCtrlLog;          /* EVENT_QUEUE_CTRL */
    struct hostCtrlLog   hostCtrlLog;           /* EVENT_HOST_CTRL */
    struct mbdStartLog   mbdStartLog;           /* EVENT_MBD_START */
    struct mbdDieLog     mbdDieLog;             /* EVENT_MBD_DIE */
    struct unfulfillLog  unfulfillLog;          /* EVENT_MBD_UNFULFILL */
    struct jobFinishLog  jobFinishLog;          /* EVENT_JOB_FINISH */
    struct loadIndexLog  loadIndexLog;          /* EVENT_LOAD_INDEX */
    struct migLog        migLog;                /* EVENT_MIG */
    struct calendarLog   calendarLog;           /* Shared by all calendar events */
    struct jobForwardLog jobForwardLog;         /* EVENT_JOB_FORWARD */
    struct jobAcceptLog  jobAcceptLog;          /* EVENT_JOB_ACCEPT */
    struct statusAckLog  statusAckLog;          /* EVENT_STATUS_ACK */
    struct signalLog     signalLog;             /* EVENT_JOB_SIGNAL */
    struct jobExecuteLog jobExecuteLog;         /* EVENT_JOB_EXECUTE */
    struct jobRequeueLog jobRequeueLog;         /* EVENT_JOB_REQUEUE */
    struct sigactLog sigactLog;                 /* EVENT_JOB_SIGACT */
    struct jobStartAcceptLog jobStartAcceptLog; /* EVENT_JOB_START_ACCEPT */
};

The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).

Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir:

#include <stdio.h>
#include <time.h>
#include <lsf/lsbatch.h>

main(argc, argv)
    int  argc;
    char *argv[];
{
    char *eventFile = "/local/lsf/work/cluster1/logdir/lsb.events";
    FILE *fp;
    struct eventRec *recrod;
    int  lineNum = 0;
    char *jobName = argv[1];
    int  i;

    if (argc != 2) {
        printf("Usage: %s jobname\n", argv[0]);
        exit(-1);
    }

    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    fp = fopen(eventFile, "r");
    if (fp == NULL) {
        perror(eventFile);
        exit(-1);
    }

    for (;;) {

        record = lsb_geteventrec(fp, &lineNum);
        if (record == NULL) {
            if (lsberrno == LSBE_EOF)
                exit(0);
            lsb_perror("lsb_geteventrec");
            exit(-1);
        }

        if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)
            continue;

        switch (record->type) {
            struct jobNewLog *newJob;
            struct jobStartLog *startJob;
            struct jobStatusLog *statusLog;

        case EVENT_JOB_NEW:
            newJob = &(record->eventLog.jobNewLog);
            printf("%s: job <%d> submitted by <%s> from <%s> to <%s> queue\n",
                    ctime(&record->eventTime), newJob->jobId, newJob->userName, 
                    newJob->fromHost, newJob->queue);
            continue;

        case EVENT_JOB_START:
            startJob = &(record->eventLog.jobStartLog);
            printf("%s: job <%d> started on ",
                    ctime(&record->eventTime), newJob->jobId);
            for (i=0; i<startJob->numExHosts; i++) 
                printf("<%s> ", startJob->execHosts[i]);
            printf("\n");
            continue;

        case EVENT_JOB_STATUS:
            statusJob = &(record->eventLog.jobStatusLog);
            printf("%s: Job <%d> status changed to: ", 
                    ctime(&record->eventTime), statusJob->jobId);
            switch(statusJob->jStatus) {

            case JOB_STAT_PEND:
                printf("pending\n");
                continue;

            case JOB_STAT_RUN:
                printf("running\n");
                continue;

            case JOB_STAT_SSUSP:
            case JOB_STAT_USUSP:
            case JOB_STAT_PSUSP:
                printf("suspended\n");
                continue;

            case JOB_STAT_UNKWN:
                printf("unknown (sbatchd unreachable)\n");
                continue;

            case JOB_STAT_EXIT:
                printf("exited\n");
                continue;

            case JOB_STAT_DONE:
                printf("done\n");
                continue;

            default:
                printf("\nError: unknown job status %d\n", statusJob->jStatus);
                continue;
            }
        default:            /* only display a few selected event types*/
            continue;
        }
    }

    exit(0);
}

Note that in the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information and can be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct file, the processing is simpler than the above example.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.