[Contents] [Prev] [Next] [End]
This chapter describes how to set up a simple LSF cluster. This simple cluster should allow your users to use LSF and run jobs with a simple default configuration. You will need to read the remaining chapters only if you want to make use of the full features and options in the LSF products.
The topics covered in this chapter are:
The initial configuration uses default settings for most parameters. Some of these parameters are mandatory. They tell the LSF daemons about the structure of your cluster by defining hosts, host types, and resources in your cluster.
The lsfsetup command is used to perform the initial configuration.
You can add more than one host at this time. Enter the host information for all hosts in the cluster. The master LIM and mbatchd daemons run on the first available host in this list, so you should list reliable batch server hosts first. For more information see 'Fault Tolerance'.
If your cluster includes client hosts, enter the host names in this step and set the SERVER field for these hosts to 0 (zero).
When you are done adding hosts enter 'q', 'Done Modifying Host Configuration'.
CAUTION!
LSF daemons start must be run as root. If you are creating a private cluster
do not attempt to use lsf_daemons to start your daemons. Start
them manually.
Note
If you choose the option to start LSF daemons for all machines using
lsadmin and badmin commands, then run 'lsadmin
limstartup', 'lsadmin resstartup', and 'badmin
hstartup' respectively instead of running Step
6 manually.
lsfsetup creates a default LSF Batch configuration, including a set of batch queues. You do not need to change any LSF Batch files to use the default configuration. After you have the system running, you may want to reconfigure LSF Batch. See 'Managing LSF Batch' for a discussion of how to do this.
You need to read this section only if you are configuring some hosts as client-only hosts.
LSF client hosts have transparent access to the resources on LSF server hosts, but do not run the LSF daemons. This means that you can run all LSF commands and tools from a client host, but your submitted jobs will only run on server hosts.
On client hosts the lsf.conf file must contain the names of some server hosts. LSF applications run on a client host contact the servers named in the lsf.conf file to find the master host.
Client hosts need access to the LSF configuration directory to read the lsf.task files. This can be a private read-only copy of the files; you do not need access to a shared copy on a file server host. If you install local copies of the lsf.task files, you must remember to update them when the files are changed on the server hosts. See 'The lsf.task and lsf.task.cluster Files' for a detailed description of these files.
Client hosts must have access to the LSF user commands in the LSF_BINDIR directory and to the nios program in the LSF_SERVERDIR directory. These can be installed directly on the client host or mounted from a file server.
The client hosts must be configured as described in Step 3 of the chapter 'Initial Configuration'.
LSF_SERVER_HOSTS="hostA hostD hostB"
Replace the example host names with the names of some LSF server hosts listed in your lsf.cluster.cluster file. You should choose reliable hosts for the LSF_SERVER_HOSTS list; if all the hosts on this list are unavailable, the client host cannot use the LSF cluster.
Before you can start any LSF daemons, you should make sure that your cluster configuration is correct. The lsfsetup program includes an option to check the LSF configuration. The default LSF Batch configuration should work as it is installed following the steps described in 'Installation'.
You should have done the configuration checking during Step 5. If you want to confirm the configuration checking again, you can run lsadmin and badmin commands.
Log into the first host listed in lsf.cluster.cluster, as the LSF administrator to check LIM configuration:
% lsadmin ckconfig -v Checking configuration files ... LSF 3.0, Dec 10, 1996 Copyright 1992-1996 Platform Computing Corporation Reading configuration from /etc/lsf.conf Dec 21 21:15:51 13412 /usr/local/lsf/etc/lim -C Dec 21 21:15:52 13412 initLicense: Trying to get license for LIM from source </usr/local/lsf/conf/license.dat> Dec 21 21:15:52 13412 main: Got 1 licenses Dec 21 21:15:52 13412 main: Configuration checked. No fatal errors found. --------------------------------------------------------- No errors found.
The messages shown above are the normal output from lsadmin ckconfig -v. Other messages may indicate problems with the LSF configuration. See 'LSF Base Configuration Reference' and 'Troubleshooting and Error Messages' if any problem is found.
To check the LSF Batch configuration files, LIM must be running on the master host. If the LIM is not running, log in as root and start LSF_SERVERDIR/lim. Wait a minute and then run the lsid program to make sure LIM is available. Then run badmin ckconfig -v:
% badmin ckconfig -v Checking configuration files ... Dec 21 21:22:14 13545 mbatchd: LSF_ENVDIR not defined; assuming /etc Dec 21 21:22:15 13545 minit: Trying to call LIM to get cluster name ... Dec 21 21:22:17 13545 readHostFile: 3 hosts have been specified in file </usr/local/lsf/conf/lsbatch/test_cluster/configdir/lsb.hosts>; only these hosts will be used by lsbatch Dec 21 21:22:17 13545 Checking Done --------------------------------------------------------- No fatal errors found.
The above messages are normal; other messages may indicate problems with the LSF Batch configuration. See 'LSF Batch Configuration Reference' and 'Troubleshooting and Error Messages' if any problem is found.
After you started the LSF daemons in your cluster, you should run some simple tests. Wait a minute or two for all the LIMs to get in touch with each other, to elect a master, and to exchange some setup information.
The testing should be performed as a non-root user. This user's PATH must include the LSF user binaries (LSF_BINDIR as defined in /etc/lsf.conf).
Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster. This section shows suggested tests and examples of correct output. The output you see on your system will reflect your local configuration.
The following steps may be performed from any host in the cluster.
% lsid LSF 3.0, Dec 10, 1996 Copyright 1992-1996 Platform Computing Corporation My cluster name is test_cluster My master name is hostA
The master name may vary but is usually the first host configured in the Hosts section of the lsf.cluster.cluster file.
If the LIM is not available on the local host, lsid displays the following message:
lsid: ls_getmastername failed: LIM is down; try later
If the LIM is not running, try running lsid a few more times. If the LIM still does not respond, see 'Troubleshooting and Error Messages'.
lsid: ls_getmastername failed: Cannot locate master LIM now, try later
means that local LIM is running, but the master LIM has not contacted the local LIM yet. Check the LIM on the first host listed in lsf.cluster.cluster. If it is running, wait for 30 seconds and try lsid again. Otherwise, another LIM will take over after one or two minutes.
% lsinfo RESOURCE_NAME TYPE ORDER DESCRIPTION r15s Numeric Inc 15-second CPU run queue length r1m Numeric Inc 1-minute CPU run queue length (alias: cpu) r15m Numeric Inc 15-minute CPU run queue length ut Numeric Inc 1-minute CPU utilization (0.0 to 1.0) pg Numeric Inc Paging rate (pages/second) ls Numeric Inc Number of login sessions (alias: login) it Numeric Dec Idle time (minutes) (alias: idle) tmp Numeric Dec Disk space in /tmp (Mbytes) mem Numeric Dec Available memory (Mbytes) ncpus Numeric Dec Number of CPUs maxmem Numeric Dec Maximum memory (Mbytes) maxtmp Numeric Dec Maximum /tmp space (Mbytes) cpuf Numeric Dec CPU factor type String N/A Host type model String N/A Host model status String N/A Host status server Boolean N/A LSF server host cserver Boolean N/A Compute Server solaris Boolean N/A Sun Solaris operating system fserver Boolean N/A File Server NT Boolean N/A Windows NT operating system TYPE_NAME hppa SUNSOL alpha sgi NTX86 rs6000 MODEL_NAME CPU_FACTOR HP735 4.0 ORIGIN2K 8.0 DEC3000 5.0 PENT200 3.0
The resource names, host types, and host models should be those configured in LSF_CONFDIR/lsf.shared.
% lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostA hppa HP735 4.00 1 128M 256M Yes (fserver hpux) hostD sgi ORIGIN2K 8.00 32 512M 1024M Yes (cserver) hostB NTX86 PENT200 3.00 1 96M 180M Yes (NT)
The output should contain one line for each host configured in the cluster, and the type, model, and RESOURCES should be those configured for that host in lsf.cluster.cluster. cpuf should match the CPU factor given for the host model in lsf.shared.
% lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostA ok 0.3 0.1 0.0 3% 1.0 1 12 122M 116M 56M hostD ok 0.6 1.2 2.0 23% 3.0 14 0 63M 698M 344M hostB ok 0.6 0.3 0.0 5% 0.3 1 0 55M 41M 37M
The output contains one line for each host in the cluster.
If any host has unavail in the status column, the master LIM is unable to contact the LIM on that host. This can occur if the LIM was started recently and has not yet contacted the master LIM, or if no LIM was started on that host, or if that host was not configured correctly.
If the entry in the status column begins with - (for example, -ok), the RES is not available on that host. RES status is checked every 90 seconds, so allow enough time for status to reflect this.
See 'Troubleshooting and Error Messages' if any of these tests do not display the expected results. If all these tests succeed, the LIMs on all hosts are running correctly.
% lsgrun -v -m "hostA hostD hostB" hostname <<Executing hostname on hostA>> hostA <<Executing hostname on hostD>> hostD <<Executing hostname on hostB>> hostB
If remote execution fails on any host, check the RES error log on that host.
Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster.
% bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostD ok - 10 1 1 0 0 0 hostA ok - 10 4 2 2 0 0 hostC unavail - 3 1 1 0 0 0
The STATUS column shows the status of sbatchd on that host. If the STATUS column contains unavail, that host is not available. Either the sbatchd on that host has not started or it has started but has not yet contacted the mbatchd. If hosts are still listed as unavailable after roughly three minutes, check the error logs on those hosts. See 'Troubleshooting and Error Messages'.
See the bhosts(1) manual page for explanations of the other columns.
% bsub sleep 60 Job <1> is submitted to default queue <normal>
If the job you submitted was the first ever, it should have job ID 1. Otherwise, the number varies.
% bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP interactive 400 Open:Active - - - - 1 1 0 0 fairshare 300 Open:Active - - - - 2 0 2 0 owners 43 Open:Active - - - - 0 0 0 0 priority 43 Open:Active - - - - 29 29 0 0 night 40 Open:Inactive - - - - 1 1 0 0 short 35 Open:Active - - - - 0 0 0 0 normal 30 Open:Active - - - - 0 0 0 0 idle 20 Open:Active - - - - 0 0 0 0
See the bqueues(1) manual page for an explanation of the output.
% bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1 user1 RUN normal hostA hostD sleep 60 Dec 10 22:44
Note that if all hosts are busy, the job is not started immediately so the STAT column says PEND. This job should take one minute to run. When the job completes, you should receive mail reporting the job completion.
You do not need to read this section if you are not using the LSF MultiCluster product.
LSF MultiCluster unites multiple LSF clusters so that they can share resources transparently, while at the same time, still maintain resource ownership and autonomy of individual clusters.
LSF MultiCluster extends the functionality of a single cluster. Configuration involves a few more steps. First you setup a single cluster as described above, then you need to do some additional steps specific to LSF MultiCluster. See 'Managing LSF MultiCluster' for more details.
You do not need to read this section if you are not using the LSF JobScheduler product.
LSF JobScheduler provides reliable production job scheduling according to user specified calendars and events. It runs user defined jobs automatically at the right time, under the right conditions, and on the right machines.
The configuration of LSF JobScheduler is almost the same as that of the LSF Batch cluster, except that you may have to define system-level calendars for your cluster and you might need to add additional events to monitor your site. Some of the configuration options for LSF Batch may not be as useful for LSF JobScheduler, such as NQS inter-operation and fairshare. You can ignore those features you do not need and use the features you think are useful for your LSF JobScheduler cluster. More details about the concept of LSF JobScheduler are described in the LSF JobScheduler User's Guide.
See 'Managing LSF JobScheduler' for details about the additional configuration options for LSF JobScheduler.
When you have finished installing and testing LSF cluster, you can let users try it out. LSF users must add LSF_BINDIR to their PATH environment variables to run the LSF utilities.
If users wish to use lstcsh as their login shell, most systems require that you modify the /etc/shells file. Add a line containing the full path name of lstcsh. Users may then use chsh(1) to choose lstcsh as their login shell. If your site uses NIS for password information, you must change /etc/shells on the NIS master host and update the NIS database. Otherwise, you must change /etc/shells on all hosts.
Note
If you configure lstcsh as the login shell for users
but do not add LSF_BINDIR/lstcsh to the /etc/shells
file, users will not be able to log in using ftp on some
versions of UNIX.
Users also need access to the on-line manual pages, which were installed in LSF_MANDIR (as defined in lsf.conf) by the lsfsetup installation procedure. For most versions of UNIX, users should add the directory LSF_MANDIR to their MANPATH environment variable. If your system has a man command that does not understand MANPATH, you should either install the manual pages in the /usr/man directory or get one of the freely available man programs.
You can use the xlsadmin GUI to do most of the cluster configuration and management work that has been described in this chapter. Details about xlsadmin can be found in 'Managing LSF Cluster Using xlsadmin'.
Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.