[Contents] [Prev] [Next] [End]


Chapter 4. Advanced Programming Topics


LSF API provides flexibility for programmers to write complex load sharing applications. Previous chapters covered the basic programming techniques using LSF APIs. This chapter will look into a few more advanced topics in LSF application programming.

Both LSLIB and LSBLIB are used in the examples of this chapter.

Getting Load Information on Selected Load Indices

'Getting Dynamic Load Information' showed an example that gets load information from the LIM. Depending on the size of your LSF cluster and the frequency at which the ls_load() function is called, returning the load information about all hosts can produce unnecessary overhead to hosts and network.

LSLIB provides a function call that will allow an application to specify a selective number of load indices and get only those load indices that are of interest to the application.

Getting a List of All Load Index Names

Since LSF allows a site to install an ELIM (External LIM) to collect additional load indices, the names and the total number of load indices are often dynamic and have to be found out at run time unless the application is only using the built-in load indices.

Below is an example routine that returns a list of all available load index names and the total number of load indices.

#include <lsf/lsf.h>

char **getIndexList(listsize)
    int *listsize;
{
    struct lsInfo *lsInfo;
    static char *nameList[MAXLOADINDEX];
    static int first = 1; 

    if (first) {           /* only need to do so when called for the first time */
        lsInfo = ls_info();
        if (lsInfo == NULL)
            return (NULL);
        first = 0;
    }

    if (listSize != NULL)
        *listSize = lsInfo->numIndx;

    for (i=0; i<lsInfo->numIndx; i++)
        nameList[i] = lsInfo->resTable[i].name;

    return (nameList);
}

The above routine returns a list of load index names currently installed in the LSF cluster. The content of listSize will be modified to the total number of load indices. The program would return NULL if the ls_info() function fails. The data structure returned by ls_info() contains all the load index names before any other resource names. The load index names start with the 11 built-in load indices followed by site external load indices (through ELIM).

Displaying Selected Load Indices

By providing a list of load index names to an LSLIB function, you can get the load information about the specified load indices.

The following example shows how you can display the values of the external load indices. This program uses the following LSLIB function:

struct hostLoad *ls_loadinfo(resreq, numhosts, options, fromhost, 
                  hostlist, listsize, namelist)

The parameters for this routine are:

char *resreq;     /* Resource requirement */
int  *numhosts;   /* Return parameter, number of hosts returned */
int  options;     /* Host and load selection options */
char *fromhost;   /* Used only if DFT_FROMTYPE is set in options */
char **hostlist;  /* Alist of candidate hosts for selection */
int  listsize;    /* Number of hosts in hostlist */
char ***namelist; /* Input/output parameter -- load index name list */

This call is similar to ls_load() except that it allows an application to supply both a list of load indices and a list of candidate hosts. If both these parameters are NULL, then it is exactly the same as ls_load() function.

The parameter namelist allows an application to specify a list of load indices of interest. the function then returns only the specified load indices. On return this parameter is modified to point to another name list that contains the same set of load index names, but in a different order to reflect the mapping of index names and the actual load values returned in the hostLoad array.

The example program follows:

#include <stdio.h>
#include <lsf.lsf.h>

main() 
{
    struct hostLoad *load;
    char **loadNames;
    int numIndx;
    int numUsrIndx;
    int nHosts;

    loadNames = getIndexList(&numIndx);
    if (loadNames == NULL) {
        ls_perror("Unable to get load index names\n");
        exit(-1);
    }

    numUsrIndx = numIndx - 11;  /* this is the total num of site defined indices*/
    if (numUsrIndx == 0) {
        printf("No external load indices defined\n");
        exit(-1);
    }

    loadNames += 11;            /* skip the 11 built-in load index names */

    load = ls_loadinfo(NULL, &nHosts, 0, NULL, NULL, 0, &loadNames);
    if (load == NULL) {
        ls_perror("ls_loadinfo");
        exit(-1);
    }

    printf("Report on external load indices\n");

    for (i=0; i<nHosts; i++) {
        printf("Host %s:\n", load[i].hostName);
        for (j=0; j<numUsrindx; j++) 
            printf("       index name: %s, value %5.0f\n", 
                loadNames[j], load[i].li[j]);
    }
}

The above program uses the getIndexList() function described in the previous example program to get a list of all available load index names. Sample output from the above program follows:

Report on external load indices
Host apple:
       index name: usr_tmp, value 87 
       index name: num_licenses, value 1
Host orange:
       index name: usr_tmp, value 18
       index name: num_licenses, value 2

Writing a Parallel Application

LSF provides job placement and remote execution support for parallel applications. LIM's host selection or placement service can return an array of good hosts for an application. The application can then use remote execution service provided by RES to run tasks on these hosts concurrently.

In this section, are examples of writing a parallel application using LSLIB.

ls_rtask() Function

'Running a Task Remotely' discussed the use of ls_rexecv() function for remote execution. There is another LSLIB call for remote execution: ls_rtask(). These two functions differ in how the client side behaves.

The ls_rexecv() is useful when local side does not need to do anything but waiting for the remote task to finish. After initiating the remote task, ls_rexecv() replaces the current program with the Network I/O Server (NIOS) by calling the execv() system call. The NIOS then handles the rest of the work on the local side: delivering input/output between local terminal and remote task and exits with the same status as the remote task. ls_rexecv() may be considered as the remote execution version of the UNIX execv() system call.

ls_rtask() provides more flexibility if the client side has to do other things after the remote task is initiated. For example, the application may want to start more than one task on several hosts. Unlike ls_rexecv(), ls_rtask() returns immediately after the remote task is started. The syntax of ls_rtask() is:

int ls_rtask(host, argv, options)

The parameters are:

char *host;     /* Name of the remote host to start task on */
char **argv;    /* Program name and arguments */
int options;    /* Remote execution options */

The options is similar to those of the ls_rexecv() function. This function returns the task ID of the remote task which can then be used by the application to differentiate possibly multiple outstanding remote tasks. When a remote task finishes, the status of the remote task is sent back to the NIOS running on the local host, which then notifies the application by issuing a SIGUSR1 signal. The application can then call ls_rwait() to collect the status of the remote task. The ls_rwait() behaves in much the same way as the wait(2) system call. ls_rtask() may be considered as a combination of remote fork() and execv().

Note
Applications calling ls_rtask() must set up signal handler for the SIGUSR1 signal, or the application could be killed by SIGUSR1.

You need to be careful if your application handles SIGTSTP, SIGTTIN, or SIGTTOU signal. If handlers for these signals are SIG_DFL, the ls_rtask() function automatically installs a handler for them to properly coordinate with the NIOS when these signals are received. If you intend to handle these signals by yourself instead of using the default set by LSLIB, you need to use the low level LSLIB function ls_stoprex() before the end of your signal handler.

Running Tasks on Many Machines

Below is an example program that uses ls_rtask() to run 'rm -f /tmp/core' on user specified hosts.

#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <lsf/lsf.h>

main (argc, argv)
    int argc;
    char *argv[];
{
    char *command[4];
    int numHosts;
    int i;
    int tid;

    if (argc <= 1) {
        printf("Usage: %s host1 [host2 ...]\n");
        exit(-1);
    }

    numHosts = argc - 1;
    command[0] = "rm";
    command[1] = "-f";
    command[2] = "/tmp/core";
    command[3] = NULL;

    if (ls_initrex(numHosts, 0) < 0) {
        ls_perror("ls_initrex");
        exit(-1);
    }

    signal(SIGUSR1, SIG_IGN);

    /* Run command on the specified hosts */
    for (i=1; i<=numHosts; i++) {
        if ((tid = ls_rtask(argv[i], command, 0)) < 0) {
            fprintf(stderr, "lsrtask failed for host %s: %s\n", 
                    argv[i], ls_sysmsg());
            exit(-1);
        }
        printf("Task %d started on %s\n", tid, argv[i]);
    }

    while (numHosts) {
        LS_WAIT_T status;

        tid = ls_rwait(&status, 0, NULL);
        if (tid < 0) {
            ls_perror("ls_rwait");
            exit(-1);
        }
    
        printf("task %d finished\n", tid);
        numHosts--;
    }

    exit(0);
}

In the above program, the signal handler for SIGUSR1 is set to SIG_IGN. This causes the signal to be ignored. It uses ls_rwait() to poll the status of remote tasks. You could have set a signal handler so that you call ls_rwait() inside the signal handler.

The task ID could be used to preform an operation on the task. For example, you can send a signal to a remote task explicitly by calling ls_rkill().

If you want to run the task on remote hosts one after another, instead of concurrently, you can call ls_rwait() right after ls_rtask().

Also note the use of ls_sysmsg() instead of ls_perror(), which does not allow flexible printing format.

The above example program produces output similar to the following:

% a.out orange apple pear
Task 1 started on orange
Task 2 started on apple
Task 3 started on pear
Task 1 finished
Task 3 finished
Task 2 finished

Note that remote tasks are run concurrently, so the order in which tasks finish is not necessarily the same as the order in which tasks are started.

Finding Out Why the Job Is Still Pending

In 'Getting Information about Batch Jobs', you saw how to get information about submitted jobs. It is frequently desirable to know the reasons why jobs are in certain status. The LSBLIB provides a function to print such information. This section describes a routine that prints out why a job is in pending status.

When lsb_readjobinfo() reads a record of a pending job, the variables reasons and subreasons contained in the returned jobInfoEnt data structure can be used to call the following LSBLIB function to get the reason text explaining why the job is still in pending state:

char *lsb_pendreason(pendReasons, subReasons, ld)

where pendReasons and subReasons are integer reason flags as returned by a lsb_readjobinfo() function while ld is a pointer to the following data structure:

struct loadIndexLog {
    int nIdx;     /* Number of load indices configured for the LSF cluster */
    char **name;  /* List of the load index names */
}

The example program below should be called by your application after lsb_readjobinfo() is called.

#include <stdio.h>
#include <lsf/lsbatch.h>

char *
reasonToText(reasons, subreasons)
    int reasons;
    int subreasons;
{
    struct loadIndexLog indices;

    /* first get the list of all load index names */
    indices.name = getIndexList(&indices.nIdx);

    return (lsb_pendreason(reasons, subreasons, &indices));
     
}

A similar routine can be written to print out the reason why a job was suspended. The corresponding LSBLIB call is:

char *lsb_suspreason(reasons, subreasons, ld)

The parameters for this function are the same as those for lsb_pendreason() function.

Reading lsf.conf Parameters

It is frequently desirable for your applications to read the contents of the lsf.conf file or even define your own site specific variables in the lsf.conf file.

The lsf.conf file follows the syntax of Bourne shell, and therefore could be sourced by a shell script and set into your environment before starting your C program. Your program can then get these variables as environment variables.

LSLIB provides a function to read the lsf.conf variables in your C program:

int ls_readconfenv(paramList, confPath)

where confPath is the directory in which the lsf.conf file is stored and paramList is an array of the following data structure:

struct config_param {
    char *paramName;   /* Name of the parameter, input */
    char *paramValue;  /* Value of the parameter, output */
}

ls_readconfenv() reads the values of the parameters defined in lsf.conf matching the names described in the paramList array. Each resulting value is saved into the paramValue variable of the array element matching paramName. If a particular parameter mentioned in the paramList is not defined in lsf.conf, then on return its value is left NULL.

The following example program reads the variables LSF_CONFDIR, MY_PARAM1, and MY_PARAM2 from the lsf.conf file and displays them on screen. Note that LSF_CONFDIR is a standard LSF parameter, while the other two parameters are user site-specific. It assumes lsf.conf is in /etc directory.

#include <stdio.h>
#include <lsf/lsf.h>

struct config_param myParams[] =
{
#define LSF_CONFDIR      0
     {"LSF_CONFDIR", NULL},
#define MY_PARAM1        1
     {"MY_PARAM1", NULL),
#define MY_PARAM2        2
     {"MY_PARAM2", NULL),
     {NULL, NULL}
}

main()
{
    if (ls_readconf(myParams, "/etc") < 0) {
        ls_perror("ls_readconfenv");
        exit(-1);
    }

    if (myParams[LSF_CONFDIR].paramValue == NULL) 
        printf("LSF_CONFDIR is not defined in /etc/lsf.conf\n");
    else
        printf("LSF_CONFDIR=%s\n", myParams[LSF_CONFDIR].paramValue);

    if (myParams[MY_PARAM1].paramValue == NULL)
        printf("MY_PARAM1 is not defined in /etc/lsf.conf\n");
    else
        printf("MY_PARAM1=%s\n", myParams[MY_PARAM1].paramValue);

    if (myParams[MY_PARAM2].paramValue == NULL)
        printf("MY_PARAM2 is not defined\n");
    else
        printf("MY_PARAM2=%s\n", myParams[MY_PARAM2].paramValue);

    exit(0);
}

The paramValue parameter in the config_param data structure must be initialized to NULL and is then modified to point to a result string if a matching paramName is found in the lsf.conf file. The array must end with a NULL paramName.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.