Sys Admin Pocket Survival Guide - OS
Home	Solaris	HP-UX	AIX

Parallel Environment

MPI vs OpenMP

Both are designed to allow programmer to use more CPU by leveraging parallel processing.

Perhaps the biggest difference is CPU vs Memory locality.

OpenMP is newer than MPI, it leverages the newer multi-core CPU and multi-CPU machines. Memory is local to the machine. At the basic level, OpenMP run within a single machine, so it is shared-memory system and so accessible to all "threads" of the program.

MPI is an older API and does NOT use shared-memory. Host kinda expected to have single CPU/core. So, each MPI process ("thread") is independent and don't have access of memory of other threads.

with the above mindset, it is easier to see what MPI and OpenMP audiences are.

MPI

MPI is considered to be a lower level API than OpenMP. To coordinate distributed processing, MPI program need to ship the data to remote node/process (cuz not sharing memory).

The main programming paradigm is scatter-gather. Ship data to remote node, have them run specific process, then collect results from them. So, this often lend to SIMD processing.

To carry our the scatter-gather processing, a diverse number of functions are provided, via the Message Passing approach (which means they are low-level programming constructs):

send / receive operations (think of multiple threads doing client/server send/receive of data, point-to-point rendezavous-type ops).
process sync (barrier)
fn to find out network size, topology, neigbour info (perhaps weather similation need to know what neigbour data is but not those of far off nodes, this topology info is not simple map-reduce), but these fn are also what makes MPI hard (and the fact that it is lower level than a Map-Reduce API).
gather and reduce ( the MapReduce paradigm of Hardoop?)

The constructs diversity means that MPI program isn't really restricted to scather/gather. It can be MIMD, it is up to the programmer on how to utilize the communication API to process data.

Because of the programming model, MPI tends to require its own "mindset" and pretty much program has to be written from the ground up with this MPI mindset.

OpenMP

OpenMP was build with focus on symmetric multiprocessing. Leverage multi-core CPU, multiple CPU on same host, with memory readily available to process. So parallelization is simpler.

use a multi-threading model. different threads can do different things. not limited to SIMD.

The multi-threading model allows programmer to adopt a core piece of the program to use OpenMP gradually, instead of write whole program in MPI style code.

code is #pragma directive that guide compiler to use openMP. Non-OpenMP compiler produces serial code.

gcc 4.2 supports openMP.

at least when running in "single-host" shared memory SMP machine, probably no need to do anything at the sys admin level once program is compiled with proper compiler (and appropriate LD_LIBRARY stuff included).

Cluster OpenMP and other works on distributed memory system.

https://computing.llnl.gov/tutorials/openMP/exercise.html has simple hello_world exercise for OMP.

seems like it will expands automatically as many threads as there are CPU cores within the SMP machine.
example shows with icc, pgcc, and gcc. from 2005!

SGE -sp smp and OpenMP is largely the same. no special daemon needed... so, if on shared-memory system, not sure what diff OpenMP has vs pthreads. perhaps only ability to scale to other nodes when appropriate compiler/library is used?

Further notes on OpenMP:
	* by itself not for shared memory design
	* run w/in a single (SMP) computer
	* not spanning machines by default, so was not told about it in grad school (also v1.0 for C/C++ released in Oct 1998)
	* so really think of it as alternative to pthreads... 
	* in hpc/cluster environment, SGE use of PE called OpenMP has limited things it need to do, since generated code just run.  Mostly, it would sets the correct OMP_NUM_THREADS for the node, and know where to launch the job.
	* a couple of web example of openmp in sge suggest just specify a PE that allows defining how many core to take, eg -pe threads 2-8, then in the qsub script to define OMP_NUM_THREADS=$NSLOTS   (NSLOTS defined in SGE inside the qsub)

sky/code/openMP > cat hello_slac.c

/* http://www.slac.stanford.edu/comp/unix/farm/openmp.html */
#include 
#include 

int main(int argc, char *argv[]) {
  int iam = 0, np = 1;

  #pragma omp parallel default(shared) private(iam, np)
  {
    #if defined (_OPENMP)
      np = omp_get_num_threads();
      iam = omp_get_thread_num();
    #endif
    printf("Hello from thread %d out of %d\n", iam, np);
  }
}



# http://www.dartmouth.edu/~rc/classes/intro_openmp/compile_run.html
# gcc 4.4 and above
cc  -fopenmp hello_llbl.c
gcc -fopenmp hello_llbl.c
icc -openmp  hello_slac.c


export OMP_NUM_THREADS=4
# if not specified, run as many threads as available cpu cores (or hyperthreads if enabled)
# if ask for more threads than avail cores, the OS will sequentialize them.
./a.out

ScaleMP

marketed as vSMP Foundation, scale across multiple physical server rather than depends on shared-memory or NUMA host.

http://www.scalemp.com/products/product-comparison/ there is a vSMP Foundation Free for up to 8 nodes / 1TB shared memory. but essentially a commercial product.

can use IB as interconnect to speed data/memory xfer, even bonding IB in adv version.

but then presumably need to run some sort of daemon process on each nodes. the cheaper licensing model lic is node-locked. so create quite a complex env to use in an batch scheduler env.
This and the cost issue maybe why folks just use MPI ? don't seems to see much ScaleMP or OpenMP, at least not in life science space.

http://www.scalemp.com/industries/lifescience/computational-chemistry/ list Schrodinger Jaguar, DOCK. Glide. Amber. Gaussian. OpenEye Fred, Omega. HMMER. mpiBLAST (but not the GPU blast?). touted for dept w/o dedicated IT.

MPI API Implementation Details

MPICH versions

MPICH 1.x - Original implementation by Argonne Nat Lab and MSU. Schrodinger is compiled with MPICH 1.2
MPICH 2 - Extension of MPICH1.x. Ohio State, mostly on C.
MVAPICH - MPI-1 implementation by Ohio State Univ. Doesn't work with Schrodinger Jaguar ?
OpenMPI - one of the latest implementation, based on MPI-2. It absorbed from the original LAM/MPI. It is quite popular now and widely supported. Infiniband OFED stack is compiled for it (and MVAPICH). SGE toutes OpenMPI as the best thing to use as it is integrated and SGE has full control of it, start/stop it as needed. Ditto, OpenMPI automatically communicate with Torque/PBS Pro and retrieve host and np info. Jobs were to be submitted using mpirun rather than qsub... http://www.open-mpi.org/faq/?category=tm
But many of the older program compiled for MPICH won't work out of the box with OpenMPI.

There are many other implementations, including commercial ones, MATLAB, Java, etc. See: wikipedia MPI Implemenatation

MPICH v1

(See config-backup/sw/mpi/mpich1.test.txt for more info).

Starting MPICH

Environment VARs:
MPI_HOME
MPI_USEP4SSPORT=yes
MPI_P4SSPORT=4644

/etc/hosts.equiv or .rhosts need to be setup, even if using ssh !!  some sys call in MPICH need this for auth.

$MPI_HOME/share/machines.LINUX		# host (+cpu) definition file
					# node1:2 would be a 4 cpu machine, but then indicate shared memory
					# which parallel Jaguar don't support.  Instead, repeat lines per node 
					# for number of CPU, eg :
					# node1
					# node1
					# node2
					# node2

To start a shared daemon as root:
ssh node1 "serv_p4 -o -p 1235 -l /nfs/mpilogs/node1.log"
ssh node2 "serv_p4 -o -p 1235 -l /nfs/mpilogs/node2.log"
# rc script on each node to start up would be good, 
# but centralized script in above form to start/kill would also be useful.
# Alternatively, Schrodinger mpich utility can start serv_p4 correctly
# (without the problem of chp4_sers which results in non-sharable deamons).


For a per-user process, can start/monitor MPICH as:

tools (scripts) in $MPI_HOME/sbin/
chp4_servs -port=4644           # script to start serv_p4 on all nodes, DOESN'T obey  MPI_P4SSPORT (def to 1234)
                                # at some point in the past also used port 1235
chp4_servs -hosts=filename      # use filename to get list of hosts to start serv_p4 (def to machines.LINUX)
chp4_servs -hunt                # list all serv_p4 process on all mpi nodes (on all ports)

chkserv -port 4644              # see which node don't have mpi daemon running
                                # DOESN'T obey  MPI_P4SSPORT (def to 1234)
                                # no output = all good.
                                # (parallel jaguar will trigger it to start anyway)

NOTE: schrodinger has mpich utility to monitor MPICH status also.

Testing MPICH


$MPI_HOME/sbin/tstmachines -v
	# see if daemons are fine.

cat $HOME/.server_apps
	exact path to each binary, should be populated automatically.


cd $MPI_HOME/examples
mpirun -np 16 cpi
	# run pi calculation test on 16 procs.
        # doesn't really start a serv_p4 process, so can't use to test sharing daemon b/w users.

Per-User Environment

(There is no need for this unless the shared root daemon process don't work)
MPICH allows a per-user instance of MPICH daemon rings instead of depending on a shared daemon run by root. This has been tested to work with Parallel Jaguar. To use this, add an environment defining the port you want to use with your set of MPI daemon ring. Your 4 digit phone extension would be a good number to use. It maybe best to add it to your $HOME/.cshrc, like this:

  setenv MPI_P4SSPORT     4644                    #change number to a unique port for yourself

After this, parallel jaguar (or mpirun) jobs should work. If there are problem, check that you have sourced /protos/package/skels/local.cshrc.linux.apps and these variables are defined:

  setenv MPI_HOME /protos/package/linux/mpich	
  setenv PATH "${PATH}:${MPI_HOME}/bin"	
  setenv MPI_USEP4SSPORT yes
  setenv P4_RSHCOMMAND ssh	
  setenv SCHRODINGER_MPI_START yes                  # Parallel Jaguar to start its own MPICH serv_p4 on user defined port

After these setup, can run parallel jaguar job like:
$SCHRODINGER/jaguar run -HOST "vic1 vic2 vic3 vic4" -PROCS 4 piperidine

OpenMPI

mpirun hostname
mpirun -H n0301,n0302 hostname
mpirun --hostfile myHostfile hostname

Example mpi host file

Example mpi host file with cpu info

n0000 slots=16
n0001 slots=16
n0002 slots=16
n0003 slots=16

If FQDN is needed, eg have multiple "domain"/cluster, use this setting:

export OMPI_MCA_orte_keep_fqdn_hostnames=t

some help options I poked at


mpirun --help  bind-to
   --bind-to {arg0}      Policy for binding processes. Allowed values: none,
                         hwthread, core, l1cache, l2cache, l3cache, socket,
                         numa, board, cpu-list ("none" is the default when
                         oversubscribed, "core" is the default when np<=2,
                         and "socket" is the default when np>2). Allowed
                         qualifiers: overload-allowed, if-supported,
                         ordered
   AMD Epyc cpu seems to need --bind-to none, cuz of CCX structure?
   https://developer.amd.com/spack/hpl-benchmark/

-d|-debug-devel|--debug-devel 
                         Enable debugging of OpenRTE
-v|--verbose             Be verbose
-V|--version             Print version and exit

Example mpirun commands


# example mpi run with docker container (nvidia gpu test)

CONT='nvcr.io/nvidia/hpc-benchmarks:21.4-hpl'    # linpack
CONT='registry.local:443/hpc-benchmarks:21.4-hpl'    # linpack
docker run     -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl \
  mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc  --bind-to none -np 2 \
  hpl.sh  --xhpl-ai  --cpu-affinity "0-7:120-127" --gpu-affinity "0:1" --cpu-cores-per-rank 8 --dat /mnt/hpl_dat/abhi.hpl-a40.dat-2.norepeat  
  # ran on c03 ==> run but poor perf.


# adapted from https://www.pugetsystems.com/labs/hpc/Outstanding-Performance-of-NVIDIA-A100-PCIe-on-HPL-HPL-AI-HPCG-Benchmarks-2149/
docker run     -v /global/home/users/tin:/mnt --gpus all \
  registry.local:443/tin/hpc-benchmarks:21.4-hpl \
  mpirun --mca btl smcuda,self -x MELLANOX_VISIBLE_DEVICES="none" -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc \
  --bind-to none  --report-bindings --debug-devel \
  -np 2 hpl.sh --xhpl-ai --dat /mnt/hpl_dat/abhi.hpl-a40.dat.norepeat.Ns20k --cpu-affinity 0:127 --cpu-cores-per-rank 4 --gpu-affinity 0:1 


docker run     -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl   mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 1   /bin/bash

docker run     -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl   mpirun --mca btl smcuda,self -x UCX_TLS=all -np 1   ucx_info -f | grep DEVICES

mpirun arguments

--mca  		# specify modular Component Arch, 
--mca btl	# byte transfer layer (P2P -- network stack)

--mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc	
	#  UCX = Unified Communication X [high bandwidth, low-latency network offload with RDMA for IB and RoCE, GPU, shared mem]

MPI stacked with OpenMP -- needed on AMD Epyc with CCX.

Dell method  to test JGI node... binary compilation info ??
/set_AMD_params.sh
su - hpl
mpirun -np 16 -x OMP_NUM_THREADS=4 --map-by ppr:1:l3cache:pe=4 --bind-to core -report-bindings -display-map numactl -l ~/amd-hpl/bin/xhpl.4

MPI stacked with OpenMP -- needed on AMD Epyc with CCX. See https://developer.amd.com/spack/hpl-benchmark/ and https://www.dell.com/support/kbdoc/en-in/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

sudo modprobe -v knem
export OMP_NUM_THREADS=4
# threads * PxQ of HPL.dat = total num of core
# their eg was for 2 socket system
# seems like should have 2 threads per socket


mpi_options="--mca mpi_leave_pinned 1 --bind-to none --report-bindings --mca btl self,vader --allow-run-as-root"
mpi_options="$mpi_options --map-by ppr:1:l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores"
mpirun $mpi_options -app ./appfile_ccx

appfile...

# appfile...
# Bind memory to node $1 and four child threads to CPUs specified in $2
# [last digit] 4 core per line, cuz NUM_THREADS=4.  
# num of lines * (4 cores per line) = total cores in system
# n02 is EPYC 7343 3.2GHz, 2x16c, so appfile have 8 lines * 4 cores.  

# HPL.dat should have PxQ of 8.  eg 2x4

# have many entries in appfile is ok... or i should say tolerable...
# so -np 4 ... 8
# change run script not to membind
# some rank would fail
# but as long as enough ranks start per need of HPL.DAT PxQ, 
# then things will run
# if need to eek out perf, then really fine tune

# xref: CF_BK/greta/benchmark_epyc_ccx

-np 1 ./xhpl_ccx.sh 0 0-3 4
-np 1 ./xhpl_ccx.sh 0 4-7 4
-np 1 ./xhpl_ccx.sh 0 8-11 4
-np 1 ./xhpl_ccx.sh 0 12-15 4
-np 1 ./xhpl_ccx.sh 1 16-19 4
-np 1 ./xhpl_ccx.sh 1 20-23 4
-np 1 ./xhpl_ccx.sh 1 24-27 4
-np 1 ./xhpl_ccx.sh 1 28-31 4

env var related to xhpl with MPI, OpenMP, the MKL one is "undocumented" and specific when intel icc compiler is used to create bin for amd...
Once these were set, mpirun without OpenMP threads was performant enough (for hardware acceptance test).

export MKL_DEBUG_CPU_TYPE=5    # this is important to address the CCX of epycs and allow xhpl to run on all cores. OMP isn't really required.
export HPL_HOST_ARCH=3  # to use AVX2 and get 16 DP OPS/cycle

OSU IB microbenchmark on AMD Epyc

https://www.dell.com/support/kbdoc/en-in/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

mpirun -np 16 --allow-run-as-root –host server1,server2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 numactl cpunodebind=${numanode_number} osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr

osu microbenchmark, mpirun with explicit req to ucx

# these works:
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca btl self,vader osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ucx   ucx_info
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -x LD_LIBRARY_PATH  -mca pml ucx osu_latency   
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ucx                                 osu_latency   
mpirun -np 2 -npernode 1 -hostfile   node2          -mca pml ucx   -x UCX_NET_DEVICES=mlx5_0:1   osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ucx   -x UCX_NET_DEVICES=mlx5_0:1   -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core  osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ucx   -x UCX_NET_DEVICES=mlx5_0:1   -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core  -mca rmaps_dist_device mlx5_0  osu_latency
mpirun -np 2 -npernode 1 -hostfile   node2          -mca pml ucx   -x UCX_NET_DEVICES=mlx5_0:1   -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core  -mca rmaps_dist_device mlx5_0  osu_latency

mpirun -np 2 -npernode 1 -hostfile   node2    -mca btl ^openib osu_latency 	# still good perf in lr7, but slow in savio4
mpirun -np 2 -npernode 1 -hostfile   node2    -mca pml ob1     osu_latency      # still good perf in lr7, but slow in savio4

# ^ is negation in mpirun
# ^tcp means DONT use tcp (not sure what it is using then... openib didn't work
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca btl ^tcp     osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca btl ^openib  osu_latency  # -mca btl openib  DONT seems to work
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca btl ^foofoo  osu_latency  # negating non existing stuff get no errror, just no net changes.

mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ^ucx  osu_latency   # no ucx then?  still works... using what?

other -mca btl stuff
-mca btl self,sm,gm


/global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/bin/ompi_info | grep pml 
/global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/bin/ompi_info  --param pml ucx --level 9
ompi_info  --param all all --level 9


# -x can be used to set environment variable
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -x OMPI_MCA_mpi_show_handle_leaks=1 osu_latency

mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -x pml_ucx_verbose=1 osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -x UCX_LOG= osu_latency

pml_ucx_verbose
pml_ucx_devices
pml_ucx_opal_mem_hooks
pml_ucx_tls

# Nope:
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7  -mca pml ucx   -x UCX_NET_DEVICES=mlx5_0:1   -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core  -mca rmaps_dist_device mlx5_0  -x UCX_TLS=rc_x  osu_latency
ie -x UCX_TLS=rc_x  cause issue
# this seems to corresponds to pml_ucx_tls param

UCX tool

since ucx need to be compiled/installed by itself, it has some binary tools:

ldd /global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/ucx-1.14.0/bin/ucx_info 
ucx_info -v
ucx_info -c
ucx_perftest  ...   need server and client

ref:

https://docs.oracle.com/cd/E19708-01/821-1319-10/mca-params.html

open-mpi faq running

18. What other options are available to mpirun?
mpirun supports the "--help" option which provides a usage message and a summary of the options that it supports. It should be considered the definitive list of what options are provided.

Several notable options are:

--hostfile: Specify a hostfile for launchers (such as the rsh launcher) that need to be told on which hosts to start parallel applications. Note that for compatibility with other MPI implementations, --machinefile is a synonym for --hostfile.
--host: Specify a host or list of hosts to run on (see this FAQ entry for more details)
--np (or -np): Indicate the number of processes to start.
--mca (or -mca): Set MCA parameters (see the Run-Time Tuning FAQ)
--wdir : Set the working directory of the started applications. If not supplied, the current working directory is assumed (or $HOME, if the current working directory does not exist on all nodes).
-x : The name of an environment variable to export to the parallel application. The -x option can be specified multiple times to export multiple environment variables to the parallel application.

open-mpi faq tuning mca param btl mpl

open-mpi faq sysadmin ompi_info --param btl all --level 9 # what network NOT to use

https://openucx.readthedocs.io/en/master/running.html has docker capabilities info.

UCX


ucx_info -d | grep Device     # -d show device + transport
ucx_info -f | grep DEVICE     # -f full decorated output

ucx_info # show help by default
  -b              Show build configuration
  -y              Show type and structures information
  -s              Show system information
  -c              Show UCX configuration
  -a              Show also hidden configuration
  -f              Display fully decorated output

mpi need to be compiled with --with-ucx

add option 
--mca pml ucx
Framework: pml
Component: ucx

try:

export OMPI_MCA_orte_base_help_aggregate=0; mpirun --mca pml ucx -np 2 -npernode 1 -H 10.0.44.59,10.0.44.60 -x LD_LIBRARY_PATH  osu_latency
export OMPI_MCA_orte_base_help_aggregate=0; mpirun --mca pml ucx -np 2 -npernode 1 --host node1,node2 -x LD_LIBRARY_PATH  osu_latency
# node1,node2 are map to IP address over TCP, not IPoIB.

(Hmm... may need to add device info?)

mpirun -np 2                -host n0000.lr7,n0001.lr7  --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x --mca coll_fca_enable 0 --mca coll_hcoll_enable 0 --mca btl_openib_if_include mlx5_0:1  --report-bindings --bind-to core --mca rmaps_dist_device mlx5_0    -npernode 1   -x LD_LIBRARY_PATH  osu_latency

mpirun -np 16 --allow-run-as-root –host server1,server2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 numactl cpunodebind= osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr

ref: OMPI v5.0.x Network Support

Intel MPI

impi_info # intel mpi info

PVM

Ref: http://www.csm.ornl.gov/pvm/

source pvm.env # get PVM_ROOT, etc
pvm # starts monitor, starting pvmd* daemon if needed.

$PVM_ROOT/lib/pvmd pvmhost.conf
# starts PVM daemon on lists specified in the conf file, whereby hosts is listed one per line.
# may want to put it in background. ^C will end everything.
# it uses RSH (ssh if defined correctly) to login to remote host to start process
# Need to ensure ssh login will source the env correctly for pvm/pvmd to run.
# Can be started by any user (what about more than one user??)

kill -SIGTERM {PVMID} can be used to kill the daemon
if use kill -9 (or other non-catchable signal), be sure to clean up /tmp/pvmd.

pvm> commands
ps
conf
halt
exit

To run OpenEye omega/rocs job, the
$PVM_ROOT/bin/$PVM_ARCH dir must have access to the desired binary (eg sym link to omega).
PATH from .login will not be sourced.

run command as:
omega -pvmconf omega.pvmconf -in carboxylic_acids_1--100.smi -out carboxylic_acids_1--100.oeb.gz -log omega_pvm.log

Each user that start pvm will have her own independent instance of pvmd3.
pvm rsh/ssh to remote host to start itself, so ports numbers are likely not going to be static.
It uses UDP for communication.

from lsof -i4 -n

process name / pid uid ...
pvmd3 27808 tinh 7u IPv4 17619158 UDP 10.220.3.20:33430
pvmd3 27808 tinh 8u IPv4 17619159 UDP 10.220.3.20:33431

tin 27808 1 0 14:25 pts/29 00:00:00 /app/pvm/pvm345/lib/LINUX/pvmd3


## omega.pvmconf
## host = req keyword
## hostname, sometime may need to be FQDN, depending on what command "hostname" returns
## n = number of instance of PVM to run
host  phpc-cn01 1
host  phpc-cn02 2
host  phpc-cn03 2

##/home/common/Environments/pvm.env

# csh environment setup for PVM 3.4.5
# currently only available for LINUX64 (LSF cluster)

setenv PVM_ROOT /app/pvm/pvm345

source ${PVM_ROOT}/lib/cshrc.stub

# http://mail.hudat.com/~ken/help/unix/.cshrc
#alias ins2path  'if ("$path:q" !~ *"\!$"* ) set path=( \!$ $path )'
#alias add2path  'if ("$path:q" !~ *"\!$"* ) set path=( $path \!$ )'
##add2path ${PVM_ROOT}/bin

## : has special meaning in cshrc, so need to escape it for it to be taken verbatim
## there is no auto shell conversion between $manpath and $MANPATH as it does for PATH
## csh is convoluted.
setenv MANPATH $MANPATH\:${PVM_ROOT}/man