[Contents] [Prev] [Next] [End]
This chapter describes the operation, maintenance and tuning of LSF Base cluster. Since LSF Base is essential to all LSF components, the correct operation of LSF Base is essential to other LSF components. This chapter should be read by all LSF cluster administrators.
Error Logs contain important information about daemon operations. When you see any abnormal behavior related to any of the LSF daemons, you should check the relevant error logs to find out the cause of the problem.
LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts run by cron(1). If you are using LSF JobScheduler, you can define a calendar-driven job to do the cleanup regularly.
All LSF log files are reopened each time a message is logged, so if you rename or remove a log file of an LSF daemon, the daemons will automatically create a new log file.
The LSF daemons log messages when they detect problems or unusual situations. The daemons can be configured to put these messages into files, or to send them to the system error logs using the syslog facility.
If LSF_LOGDIR is defined in the /etc/lsf.conf file, LSF daemons try to store their messages in files in that directory. Note that LSF_LOGDIR must be writable by root. The error log file names for the LSF Base system daemons, LIM and RES, are lim.log.hostname and res.log.hostname.
The error log file names for LSF Batch daemons are sbatchd.log.hostname, mbatchd.log.hostname, and pim.log.hostname. The LSF JobScheduler also has a eeventd.log.hostname file, in addition to all the log files of LSF Batch.
If LSF_LOGDIR is defined but the daemons cannot write to files there, the error log files are created in /tmp.
If LSF_LOGDIR is not defined, then errors are logged to syslog using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the manual pages for syslog and/or syslogd.
LSF daemons log error messages in different levels so that you can choose to log all messages or only log messages that are critical enough. This is controlled by parameter LSF_LOG_MASK in the lsf.conf file. Possible values for this parameter can be any log priority symbol that is defined in syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.
If the error log is managed by syslog, it is probably already being automatically cleared.
If LSF daemons cannot find the lsf.conf file when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you can not find any error messages in the log files, they are likely in the syslog.
See 'Troubleshooting and Error Messages' for discussion of the more common problems and error log messages.
The FLEXlm license server daemons log messages about the state of the license servers, and when licenses are checked in or out. This log helps to resolve problems with the license servers and to track license use.
The FLEXlm log is configured by the lsflicsetup command as described in 'Installing a New Permanent License'. This log file grows over time. You can remove or rename the existing FLEXlm log file at any time. The script lsf_license used to run the FLEXlm daemons creates a new log file when necessary.
Note
If you already have FLEXlm server running for other products and LSF licenses are added to the existing license file, then the log messages for FLEXlm should go to the same place as you previously set up for other products.
The LSF cluster administrator can monitor the status of the hosts in a cluster, start and stop the LSF daemons, and reconfigure the cluster. Many operations are done using the lsadmin command, which performs administrative operations on LSF Base daemons, LIM and RES.
The lshosts and lsload commands report the current status and load levels of hosts in an LSF cluster. The lsmon and xlsmon commands provide a running display of the same information. The LSF administrator can find unavailable or overloaded hosts with these tools.
% lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 88M hostB -ok 0.1 0.3 0.7 0% 0.0 1 67 45M 25M 34M hostA busy 8.0 *7.0 4.9 84% 4.6 6 17 1M 81M 27M
When the status of a host is proceeded by a '-', it means RES is not running on that host. In the above example, RES on hostB is down.
LIM and RES can be restarted to upgrade software or clear persistent errors. Jobs running on the host are not affected by restarting the daemons. The LIM and RES daemons are restarted using the lsadmin command:
% lsadmin lsadmin>limrestart hostD Checking configuration files ... No errors found. Restart LIM on <hostD> ...... done lsadmin>resrestart hostD Restart RES on <hostD> ...... done lsadmin>quit
Note
You must login as the LSF cluster administrator to run the lsadmin command.
The lsadmin command can be applied to all available hosts by using the host name 'all'; for example, lsadmin limrestart all. If a daemon is not responding to network connections lsadmin displays an error message with the host name. In this case you must kill and restart the daemon by hand.
LSF administrators can start up any, or all, LSF daemons, on any, or all, LSF hosts, from any host in the LSF cluster. For this to work, file /etc/lsf.sudoers has to be setup properly to allow you to start up daemons as root and you should be able to run rsh across LSF hosts without having to enter a password. See 'The lsf.sudoers File' for configuration details of lsf.sudoers.
The 'limstartup' and 'resstartup' options in lsadmin allow for the startup of the LIM and RES daemons respectively. Specifying a host name allows for starting up a daemon on particular host. For example,
% lsadmin limstartup hostA Starting up LIM on <hostA> ...... done
% lsadmin resstartup hostA Starting up RES on <hostA> ...... done
The lsadmin command can be used to start up all available hosts by using the host name 'all'; for example, 'lsadmin limstartup all'. All LSF daemons, including LIM, RES, and sbatchd, can be started on all LSF hosts using the command lsfstartup.
All LSF daemons can be shut down at any time. If the LIM daemon on the current master host is shut down, another host automatically takes over as master. If the RES daemon is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted. To shutdown LIM and RES, use the lsadmin command:
% lsadmin lsadmin>limshutdown hostD Shut down LIM on <hostD> ...... done lsadmin>resshutdown hostD Shut down RES on <hostD> ...... done lsadmin>quit
You can run lsadmin reconfig while the LSF system is in use; users may be unable to submit new jobs for a short time, but all current remote executions are unaffected.
A LIM can be locked to temporarily prevent any further jobs from being sent to the host. The lock can be set to last either for a specified period of time, or until the host is explicitly unlocked. Only the local host can be locked and unlocked.
% lsadmin limlock Host is locked % lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 28M hostA busy 8.0 *7.0 4.9 84% 0.6 0 17 *1M 31M 7M hostC lockU 0.8 1.0 1.1 73% 1.2 3 0 4M 44M 12M % lsadmin limunlock Host is unlocked
Only root and the LSF administrator can lock and unlock hosts.
LSF configuration consists of several levels:
This is the generic LSF environment configuration file. This file defines general installation parameters so that all LSF executables can find the necessary information. This file is typically installed in the directory all LSF server binaries are installed and a symbolic link is made from a convenient directory as defined by the environment variable LSF_ENVDIR, or the default directory /etc. This file is created by the lsfsetup during installation. Note that many of the parameters in this file are machine specific. Detailed contents of this file are described in 'The lsf.conf File'.
LIM is the kernel of your cluster that provides the single system image to all applications. LIM reads the LIM configuration files and determines your cluster and the cluster master host.
LIM files include lsf.shared, and lsf.cluster.cluster, where cluster is the name of your LSF cluster. These files define the host members, general host attributes, and resource definitions for your cluster. The individual functions of each of the files are described below.
lsf.shared defines the available resource names, host types, host models, cluster names and external load indices that can be used by all clusters. This file is shared by all clusters.
lsf.cluster.cluster file is a per cluster configuration file. It contains two types of configuration information: cluster definition information and LIM policy information. Cluster definition information impacts all LSF applications, while LIM policy information impacts applications that rely on LIM's policy for job placement.
The cluster definition information defines cluster administrators, all the hosts that make up the cluster, attributes of each individual host such as host type, host model, resources using the names defined in lsf.shared.
LIM policy information defines the load sharing and job placement policies provided by LIM. More details about LIM policies are described in 'Tuning LIM Load Thresholds'.
LIM configuration files are stored in directory LSF_CONFDIR as defined in the lsf.conf file. Details of LIM configuration files are described in 'The lsf.shared File'.
lsf.task is a system-wide task to default resource requirement string mapping file. This file defines mappings between task names and their default resource requirements. LSF maintains a task list for each user in the system. The lsf.task file is useful for the cluster administrator to set task-to-resource requirement mapping at the system level. Individual users can customize their own list by using the lsrtasks command (See lsrtasks(1) man page for details of this command).
When you run a job with a LSF command such as bsub or lsrun, the command consults your task list to find out the default resource requirement string of the job if they are not already specified explicitly. If a match is not found in your task list, the system will assume a default, which typically means run the job on a host that has the same host type as the local host.
There is also a per cluster file lsf.task.cluster that applies to the cluster only and overrides the system-wide definition. Individual users can have their own files to override the system-wide and cluster-wide files by using the lsrtasks command.
lsf.task and lsf.task.cluster files are installed in directory LSF_CONFDIR as defined in the lsf.conf file.
These files define LSF Batch specific configuration such as queues, batch server hosts and batch user controls. These files are only read by mbatchd. The LSF Batch configuration relies on LIM configuration. LSF Batch daemons get the cluster configuration information from the LIM via the LSF API.
LSF Batch configuration files are stored in directory LSB_CONFDIR/cluster, where LSB_CONFDIR is defined in lsf.conf, and cluster is the name of your cluster. Details of LSF Batch configuration files are described in 'Managing LSF Batch'.
All configuration files except lsf.conf use a section-based format. Each file contains a number of sections. Each section starts with a line beginning with the reserved word Begin followed by a section name, and ends with a line beginning with the reserved word End followed by the same section name. Begin, End, section names and keywords are all case insensitive.
Sections can either be vertical or horizontal. A horizontal section contains a number of lines, each having the format: keyword = value, where value is one or more strings. For example:
Begin exampleSection key1 = string1 key2 = string2 string3 key3 = string4 End exampleSection
Begin exampleSection key1 = STRING1 key2 = STRING2 STRING3 End exampleSection
In many cases you can define more than one object of the same type by giving more than one horizontal section with the same section name.
A vertical section has a line of keywords as the first line. The lines following the first line are values assigned to the corresponding keywords. Values that contain more than one string must be bracketed with '(' and ')'. The above examples can also be expressed in one vertical section:
Begin exampleSection key1 key2 key3 string1 (string2 string3) string4 STRING1 (STRING2 STRING3) - End exampleSection
Each line in a vertical section is equivalent to a horizontal section with the same section name.
Some keys in certain sections are optional. For a horizontal section, an optional key does not appear in the section if its value is not defined. For a vertical section, an optional keyword must appear in the keyword line if any line in the section defines a value for that keyword. To specify the default value use '-' or '()' in the corresponding column, as shown for key3 in the example above.
Each line may have multiple columns, separated by either spaces or TAB characters. Lines can be extended by a '\' (back slash) at the end of a line. A '#' (pound sign) indicates the beginning of a comment; characters up to the end of the line are not interpreted. Blank lines are ignored.
Below are some examples of LIM configuration and LSF Batch configuration files. Detailed explanations of the variables are described in 'LSF Base Configuration Reference'.
Begin Cluster ClusterName # This line is keyword(s) test_cluster End Cluster Begin HostType TYPENAME # This line is keyword(s) hppa SUNSOL rs6000 alpha NTX86 End HostType Begin HostModel MODELNAME CPUFACTOR # This line is keyword(s) HP735 4.0 DEC3000 5.0 ORIGIN2K 8.0 PENTI120 3.0 End HostModel Begin Resource RESOURCENAME DESCRIPTION #This line is keyword(s) hpux (HP-UX operating system) decunix (Digital Unix) solaris (Sun Solaris operating system) NT (Windows NT operating system) fserver (File Server) cserver (Compute Server) End Resource
Example lsf.cluster.test_cluster file:
Begin ClusterManager Manager = lsf user7 End ClusterManager Begin Host HostNAme Model Type server swp Resources hostA HP735 hppa 1 2 (fserver hpux) hostD ORIGIN2K sgi 1 2 (cserver) hostB PENT200 NTX86 1 2 (NT) End Host
In the above file, section ClusterManager takes horizontal format, while Host section takes vertical format.
Other LSF Batch configuration files are described in 'Example LSF Batch Configuration Files'.
This section gives procedures for some common changes to the LIM configuration. There are three different ways for you to change LIM configuration:
The following discussions focus on changing configuration files using an editor so that you can understand the concepts behind the configuration changes. See 'Managing LSF Cluster Using xlsadmin' for the use of xlsadmin in changing configuration files.
Note:
If you run LSF Batch, you must restart mbatchd using 'badmin reconfig' command each time you change the LIM configuration, even if the LSF Batch configuration files do not change. This is needed because the LSF Batch configuration depends on the LIM configuration.
CAUTION!
LSF daemons start must be run as root. If you are creating a private cluster, do not attempt to use lsf_daemons to start your daemons. Start them manually.
% lsadmin resshutdown host1 host2 ...
where host1, host2, ... are hosts you want to remove from your cluster.
Your cluster is most likely heterogeneous. Even if your computers are all the same, it may still be heterogeneous. For example, some machines are configured as file servers, while others are compute servers; some have more memory, others have less, some have four CPUs, others have only one; some have host-locked software licenses installed, others do not.
LSF provides powerful resource selection mechanisms so that correct hosts with required resources are chosen to run your jobs. For maximum flexibility, you should characterize your resources clear enough so that users have enough choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called fddi and associate the fddi resource to machines connected to FDDI. This way, users can specify resource fddi if they want their jobs to run on machines connected to FDDI.
To customize host resources for your cluster, follow the following procedures.
Some hosts are dedicated to running a particular application or class of applications. For example, a software group might have a compute server with very fast local disk and a special compiler license. This compute server is intended to run compilation jobs only.
You can make a host dedicated to a resource to disallow unwanted jobs to go to a particular host. Dedicated resources are defined just like Boolean resources. To make a host dedicated to a resource, precede the resource name by an exclamation mark '!' in the RESOURCES column of the lsf.cluster.cluster. If a host is dedicated to a resource, LIM only selects that host if the application requires the dedicated resource. For example, add an f77 resource to the systems cluster, and dedicate a host to that resource.
% lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostC SUN4 SunIPC 2.7 1 24M 48M Yes (sparc !f77) hostE SUN4 SunIPC 2.7 2 96M 170M Yes (sparc f77)
Now when you run an ordinary load shared command, only hostE is eligible. If you ask for the f77 resource in the resource requirement, both hosts are eligible.
Note
Dedicated resources are not a secure way of controlling access to a host. Users can specify the dedicated resource on the command line, and they can also force remote execution on a particular host by explicitly specifying the host name. Use LSF Batch policy to enforce access control.
After changing LIM configuration files you must tell LIM to read the new configuration. Use the lsadmin commands to tell LIM to pick up the new configuration.
Operations can be specified on the command line or entered at a prompt. Run the lsadmin command with no arguments, and type help to see the available operations.
The lsadmin reconfig command checks the LIM configuration files for errors. If no errors are found, the command confirms that you want to restart the LIMs on all hosts, and reconfigures all the LIM daemons:
% lsadmin reconfig Checking configuration files ... No errors found. Do you really want to restart LIMs on all hosts? [y/n] y Restart LIM on <hostD> ...... done Restart LIM on <hostA> ...... done Restart LIM on <hostC> ...... done
In the above example no errors are found. If any non-fatal errors are found, the command asks you to confirm the reconfiguration. If fatal errors are found, the reconfiguration is aborted.
If you want to see details on any errors, run the command lsadmin ckconfig -v. This reports all errors to your terminal.
If you change the configuration file of LIM, you should also reconfigure LSF Batch by running badmin reconfig because LSF Batch depends on LIM configuration. If you change the configuration of LSF Batch, then you only need to run badmin reconfig.
The LIM can be extended to support additional load information through external load indices. Examples of external load indices are network traffic, free disk space on scratch file systems, and available software licenses.
LSF supports over 100 external load indices, which can be defined for each cluster. The external load indices are calculated by a separate program, called the External LIM or ELIM. By providing your own ELIM you can configure your cluster to keep track of any dynamic information you choose.
The ELIM can be any executable program, either an interpreted script or compiled code. Example code for an ELIM is included in the misc directory in the LSF distribution. The elim.c file is an ELIM written in C. You can customize this example to collect the load indices you want.
The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:
N name1 value1 name2 value2 ... nameN valueN
3 tmp2 47.5 nio 344.0 licenses 5
This string reports 3 indices: tmp2, nio, and licenses, with values 47.5, 344.0, and 5 respectively. Index names must be defined in the NewIndex section of the lsf.shared file (see 'Configuring External Load Indices'). Index values must be numbers between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.
If the ELIM is implemented as a C program, as part of initialization it should use setbuf(3) to establish unbuffered output to stdout.
The ELIM should ensure that the entire load update string is written successfully to stdout. This can be done by checking the return value of printf(3s) if the ELIM is implemented as a C program or the return code of /bin/echo(1) from a shell script. The ELIM should exit if it fails to write the load information.
Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.
The executable for the ELIM must be in LSF_SERVERDIR and must have the name 'elim'. If any external load indices are defined in the NewIndex section of the lsf.shared file, the LIM invokes the ELIM automatically on startup. The ELIM runs with the same user id and file access permission as the LIM.
The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM signal to the ELIM. The ELIM must exit upon receiving this signal.
Configure external load indices in the NewIndex section of the lsf.shared file. For each index, provide a name, an update interval, the direction of change with increasing load, and a description. For the ELIM above, the NewIndex section might be as follows.
Begin NewIndex NAME INTERVAL INCREASING DESCRIPTION tmp2 30 N (Disk space in /usr/tmp in MB) nio 15 Y (Network I/O in KB/second) licenses 60 N (Number of licenses available) End NewIndex
Note
The name of the load index must not be one of the resource name aliases cpu, idle, logins, or swap.
The update interval is for user information only. The actual update interval is controlled by your ELIM.
For a complete explanation of the meaning of the keywords, see 'External Load Indices'.
By default, the LIM polls the ELIM every five seconds for information. If the interval for the most frequently updated external index is less than five seconds, then the ELIM_POLL_INTERVAL parameter in the Parameters section of the lsf.cluster.cluster file can be used to specify a shorter sampling interval. Additionally it is necessary to set the EXINTERVAL (minimum exchange interval) parameter in the lsf.cluster.cluster file to the same value as ELIM_POLL_INTERVAL to ensure that the load index values collected from the ELIM are sent to the master LIM promptly.
After configuring the external load indices in lsf.shared, run lsadmin reconfig to check the configuration and restart the LIM on all hosts.
The ELIM can also return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM. The ELIM must ensure that the semantics of any index it supplies is the same as that of the corresponding index returned by the lsinfo(1) command.
For example, some sites prefer to use /usr/tmp for temporary files. To override the tmp load index, write a program that periodically measures the space in the /usr/tmp file system, and writes the value to standard output. Name this program elim and put it in the LSF_SERVERDIR directory.
Note
The name of an external load index must not be one of the resource name aliases cpu, idle, logins, or swap. To override one of these indices, use its formal name: r1m, it, ls, or swp.
You must configure the external load index even if you are overriding a built-in load index.
LIM provides very critical services to the all LSF components. In addition to the timely collection of resource information, LIM also provides host selection and job placement policies. If you are using the LSF MultiCluster product, LIM policies also determine how different clusters should exchange load and resource information.
LIM policies are advisory information for applications. Applications can either use the placement decision from the LIM, or make further decisions based on information from the LIM.
Most of the LSF interactive tools, such as lsrun, lsmake, and lstcsh, use LIM policies to place jobs on the network. LSF Batch and LSF JobScheduler use load and resource information from LIM and make their own placement decisions based on other factors in addition to load information.
As was described in 'Overview of LSF Configuration Files', LIM configuration file defines load-sharing policies. The LIM configuration parameters that affect LIM policies include:
If a particular load index is not specified, then LIM assumes that there is no threshold for that load index. Define looser values for load thresholds if you want to aggressively run jobs on a host. See 'Threshold Fields' for details about load thresholds.
If you do not want LIM to place jobs to some hosts during certain hours, you can define run windows for these hosts in the lsf.cluster.cluster. Dispatch windows in lsf.cluster.cluster cause hosts to become locked outside the time windows so that LIM will not advise jobs to go to those hosts. Details of this parameter is described in 'Hosts'.
Note
LIM thresholds and run windows affect the job placement advice of
the LIM. Job placement advice is not enforced by LIM. LSF Batch, for example,
does not follow the policies of the LIM.
There are two main goals in adjusting the LIM configuration parameters: improving response time, and reducing interference with interactive use. To improve response time, LSF should be tuned to correctly select the best available host for each job. To reduce interference, LSF should be tuned to avoid overloading any host.
CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that the response time is minimized. To achieve this, it is important that you define correct CPU factors for each machine model in your cluster by changing the HostModel section of your lsf.shared file.
CPU factors should be set based on a benchmark that reflects your work load. (If there is no such benchmark, CPU factors can be set based on raw CPU power.) The CPU factor of the slowest hosts should be set to one, and faster hosts should be proportional to the slowest. For example, consider a cluster with two hosts, hostA and hostB, where hostA takes 30 seconds to run your favourite benchmark and hostB takes 15 seconds to run the same test. hostA should have a CPU factor of 1, and hostB (since it is twice as fast) should have a CPU factor of 2.
LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. The normalized ratings can be seen by running the lsload -N command. The hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.
Incorrect CPU factors can reduce performance in two ways. If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host. If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused
Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. The LIM then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.
The Host section of the lsf.cluster.cluster file can contain busy thresholds for load indices. You do not need to specify a threshold for every index; indices that are not listed do not affect the scheduling decision. These thresholds are a major factor influencing LSF performance. This section does not describe all LSF load indices; see 'Resource Requirements' and 'Threshold Fields' for more complete discussions.
The parameters that most often affect performance are:
For tuning these parameters you should compare the output of lsload to the thresholds reported by lshosts -l.
The lsload and lsmon commands display an asterisk '*' next to each load index that exceeds its threshold. For example, consider the following output from lshosts -l and lsload:
% lshosts -l HOST_NAME: hostD ... LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem - 3.5 - - 15 - - - - 2M 1M HOST_NAME: hostA ... LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem - 3.5 - - 15 - - - - 2M 1M
% lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 0.0 0.0 0.0 0% 0.0 6 0 30M 32M 10M hostA busy 1.9 2.1 1.9 47% *69.6 21 0 38M 96M 60M
In this example, hostD is ok. However, hostA is busy; the pg (paging rate) index is 69.6, above the threshold of 15.
Other monitoring tools such as xlsmon also help to show the effects of changes.
If the LIM often reports a host to be busy when the CPU run queue length is low, the most likely cause is the paging rate threshold. Different versions of UNIX assign subtly different meanings to the paging rate statistic, so the threshold needs to be set at different levels for different host types. In particular, HP-UX systems need to be configured with significantly higher pg values; try starting at a value of 50 rather than the default 15.
If the LIM often shows systems busy when the CPU utilization and run queue lengths are relatively low and the system is responding quickly, try raising the pg threshold. There is a point of diminishing returns; as the paging rate rises, eventually the system spends too much time waiting for pages and the CPU utilization decreases. Paging rate is the factor that most directly affects perceived interactive response. If a system is paging heavily, it feels very slow.
The CPU run queue threshold can be reduced if you find that interactive jobs slow down your response too much while the LIM still reports your host as ok. Likewise, it can be increased if hosts become busy at too low a load.
On multiprocessor systems the CPU run queue threshold is compared to the effective run queue length as displayed by the lsload -E command. The run queue threshold should be configured as the load limit for a single processor. Sites with a variety of uniprocessor and multiprocessor machines can use a standard value for r15s, r1m, and r15m in the configuration files, and the multi-processor machines will automatically run more jobs. Note that the normalized run queue length printed by lsload -N is scaled by the number of processors. See the 'Resources' chapter of the LSF User's Guide and lsfintro(1) for the concept of effective and normalized run queue lengths.
Because LSF takes a wide variety of measurements on the hosts in your network, it can be a powerful tool for monitoring and capacity planning. The lsmon command gives updated information that can quickly identify problems such as inaccessible hosts or unusual load levels. The lsmon -L option logs the load information to a file for later processing. See the lsmon(1) and lim.acct(5) manual pages for more information.
For example, if the paging rate (pg) on a host is always high, adding memory to the system will give a significant increase in both interactive performance and total throughput. If the pg index is low but the CPU utilization (ut) is usually more than 90 percent, the CPU is the limiting resource. Getting a faster host, or adding another host to the network, would provide the best performance improvement. The external load indices can be used to track other limited resources such as user disk space, network traffic, or software licenses.
The xlsmon program is a Motif graphic interface to the LSF load information. The xlsmon display uses colour to highlight busy and unavailable hosts, and can show both the current levels and scrolling histories of selected load indices.
See the 'Cluster Information' chapter of the LSF User's Guide for more information about xlsmon.
LSF software is licensed using the FLEXlm license manager from Globetrotter Software, Inc. The LSF license key controls the hosts allowed to run LSF. The procedures for obtaining, installing and upgrading license keys are described in 'Getting License Key Information' and 'Setting Up the License Key'. This section provides background information on FLEXlm.
FLEXlm controls the total number of hosts configured in all your LSF clusters. You can organize your hosts into clusters however you choose. Each server host requires at least one license; multiprocessor hosts require more than one, as a function of the number of processors. Each client host requires 1/5 of a license.
LSF uses two kinds of FLEXlm license: time-limited DEMO licenses and permanent licenses.
The DEMO license allows you to try LSF out on an unlimited number of hosts on any supported host type. The trial period has a fixed expiry date, and the LSF software will not function after that date. DEMO licenses do not require any additional daemons.
Permanent licenses are the most common. A permanent license limits only the total number of hosts that can run the LSF software, and normally has no time limit. You can choose which hosts in your network will run LSF, and how they are arranged into clusters. Permanent licenses are counted by a license daemon running on one host on your network.
For permanent licenses, you need to choose a license server host and send hardware host identification numbers for the license server host to your software vendor. The vendor uses this information to create a permanent license that is keyed to the license server host. Some host types have a built-in hardware host ID; on others, the hardware address of the primary LAN interface is used.
FLEXlm is used by many UNIX software packages because it provides a simple and flexible method for controlling access to licensed software. A single FLEXlm license server can handle licenses for many software packages, even if those packages come from different vendors. This reduces the systems administration load, since you do not need to install a new license manager every time you get a new package.
FLEXlm uses a daemon called lmgrd to manage permanent licenses. This daemon runs on one host on your network, and handles license requests from all applications. Each license key is associated with a particular software vendor. lmgrd automatically starts a vendor daemon; the LSF version is called lsf_ld and is provided by Platform Computing Corporation. The vendor daemon keeps track of all licenses supported by that vendor. DEMO licenses do not require you to run license daemons.
The license server daemons should be run on a reliable host, since licensed software will not run if it cannot contact the server. The FLEXlm daemons create very little load, so they are usually run on the file server. If you are concerned about availability, you can run lmgrd on a set of three or five hosts. As long as a majority of the license server hosts are available, applications can obtain licenses.
Software licenses are stored in a text file. The default location for this file is /usr/local/flexlm/licenses/license.dat, but this can be overridden. The license file must be readable on every host that runs licensed software. It is most convenient to place the license file in a shared NFS directory.
The license.dat file normally contains:
The FEATURE line contains an encrypted code to prevent tampering. For permanent licenses, the licenses granted by the FEATURE line can be accessed only through license servers listed on the SERVER lines.
For DEMO licenses no FLEXlm daemons are needed, so the license file contains only the FEATURE line.
Here is an example of a DEMO license file. This file contains one line for each separate component (see 'Modifying LSF Components and Licensing'). However, no SERVER or DAEMON information is needed. The license is for LSF version 3.0 and is valid until Jan. 10, 1997.
FEATURE lsf_base lsf_ld 3.000 10-Jan-1997 0 5C51F231E238555BAD7F "Platform" DEMO FEATURE lsf_batch lsf_ld 3.000 10-Jan-1997 0 6CC1D2C137651068E23C "Platform" DEMO FEATURE lsf_mc lsf_ld 3.000 10-Jan-1997 0 2CC1F2E132C85B8D1806 "Platform" DEMO
The following is an example of a permanent license file. The license server is configured to run on hostD, using TCP port 1700. This allows 10 hosts to run LSF, with no expiry date.
SERVER hostD 08000962cc47 1700 DAEMON lsf_ld /usr/local/lsf/etc/lsf_ld FEATURE lsf_base lsf_ld 3.000 01-Jan-0000 0 51F2315CE238555BAD7F "Platform" FEATURE lsf_batch lsf_ld 3.000 01-Jan-0000 0 C1D2C1376C651068E23C "Platform" FEATURE lsf_mc lsf_ld 3.000 01-Jan-0000 0 C1F2E1322CC85B8D1806 "Platform"
FLEXlm provides several utility programs for managing software licenses. These utilities and their manual pages are included in the LSF software distribution.
Because these utilities can be used to shut down the FLEXlm license server, and thus prevent licensed software from running, they are installed in the LSF_SERVERDIR directory. The file permissions are set so that only root and members of group 0 can use them.
For complete details on these commands, see the on-line manual pages.
FLEXlm only accepts one license key for each feature listed in a license key file. If there is more than one FEATURE line for the same feature, only the first FEATURE line is used. To add hosts to your LSF cluster, you must replace the old FEATURE line with a new one listing the new total number of licenses.
The procedure for updating a license key file to include new license keys is described in 'Adding a Permanent License'.
The fourth field on the SERVER line specifies the TCP port number that the FLEXlm server uses. Choose an unused port number. LSF usually uses port numbers in the range 3879 to 3882, so the numbers from 3883 on are good choices. If the lmgrd daemon complains that the license server port is in use, you can choose another port number and restart lmgrd.
For example, if your license file contains the line:
SERVER hostname host-id 1700
and you want your FLEXlm server to use TCP port 3883, change the SERVER line to:
SERVER hostname host-id 3883
LSF Suite V3.0 includes the following components: LSF Base, LSF Batch, LSF JobScheduler (formerly LSF-PJS), and LSF MultiCluster.
The configuration changes to enable a particular component in a cluster is handled during installation by lsfsetup. If at some later time you want to modify the components of your cluster, edit the FEATURES line in the Parameters section of the lsf.cluster.cluster file. You can specify any combination of the strings 'lsf_base', 'lsf_batch', 'lsf_js', and 'lsf_mc' to enable the operation of LSF Base, LSF Batch, LSF JobScheduler and LSF MultiCluster, respectively. If any of 'lsf_batch', 'lsf_js', or 'lsf_mc' are specified then 'lsf_base' is automatically enabled as well.
If the lsf.cluster.cluster file is shared, adding a component name to the FEATURES line enables that component for all hosts in the cluster. For example, enable the operation of LSF Base, LSF Batch and LSF MultiCluster:
Begin Parameters FEATURES=lsf_batch lsf_mc End Parameters
Enable the operation of LSF Base only:
Begin Parameters FEATURES=lsf_base End Parameters
Enable the operation of LSF JobScheduler:
Begin Parameters FEATURES=lsf_js End Parameters
It is possible to indicate that only certain hosts run LSF Batch or LSF JobScheduler within a cluster. This is done by specifying 'lsf_batch' or 'lsf_js' in the RESOURCES field on the HOSTS section of the lsf.cluster.cluster file. For example, the following enables hosts hostA, hostB, and hostC to run LSF JobScheduler and hosts hostD, hostE, and hostF to run LSF Batch.
Begin Parameters FEATURES=lsf_batch End Parameters Begin Host HOSTNAME model type server RESOURCES hostA SUN41 SPARCSLC 1 (sparc bsd lsf_js) hostB HPPA9 HP735 1 (linux lsf_js) hostC SGI SGIINDIG 1 (irix cs lsf_js) hostD SUNSOL SunSparc 1 (solaris) hostE HP_UX A900 1 (hpux cs bigmem) hostF ALPHA DEC5000 1 (alpha) End Hosts
The license file used to serve the cluster must have the corresponding features. A host will show as unlicensed if the license for the component it was configured to run is unavailable. For example, if a cluster is configured to run LSF JobScheduler on all hosts, and the license file does not contain the LSF JobScheduler feature, than the hosts will be unlicensed, even if there are licenses for LSF Base or LSF Batch.
Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.