http://www.scalemp.com/industries/lifescience/computational-chemistry/ list Schrodinger Jaguar, DOCK. Glide. Amber. Gaussian. OpenEye Fred, Omega. HMMER. mpiBLAST (but not the GPU blast?). touted for dept w/o dedicated IT.
MPI API Implementation Details
MPICH versions
- MPICH 1.x - Original implementation by Argonne Nat Lab and MSU. Schrodinger is compiled with MPICH 1.2
- MPICH 2 - Extension of MPICH1.x. Ohio State, mostly on C.
- MVAPICH - MPI-1 implementation by Ohio State Univ. Doesn't work with Schrodinger Jaguar ?
- OpenMPI - one of the latest implementation, based on MPI-2. It absorbed from the original LAM/MPI.
It is quite popular now and widely supported. Infiniband OFED stack is compiled for it (and MVAPICH).
SGE toutes OpenMPI as the best thing to use as it is integrated and SGE has full control of it, start/stop it as needed. Ditto, OpenMPI automatically communicate with Torque/PBS Pro and retrieve host and np info. Jobs were to be submitted using mpirun rather than qsub...
http://www.open-mpi.org/faq/?category=tm
But many of the older program compiled for MPICH won't work out of the box
with OpenMPI.
There are many other implementations, including commercial ones, MATLAB, Java, etc.
See:
wikipedia MPI Implemenatation
MPICH v1
(See config-backup/sw/mpi/mpich1.test.txt for more info).
Starting MPICH
Environment VARs:
MPI_HOME
MPI_USEP4SSPORT=yes
MPI_P4SSPORT=4644
/etc/hosts.equiv or .rhosts need to be setup, even if using ssh !! some sys call in MPICH need this for auth.
$MPI_HOME/share/machines.LINUX # host (+cpu) definition file
# node1:2 would be a 4 cpu machine, but then indicate shared memory
# which parallel Jaguar don't support. Instead, repeat lines per node
# for number of CPU, eg :
# node1
# node1
# node2
# node2
To start a shared daemon as root:
ssh node1 "serv_p4 -o -p 1235 -l /nfs/mpilogs/node1.log"
ssh node2 "serv_p4 -o -p 1235 -l /nfs/mpilogs/node2.log"
# rc script on each node to start up would be good,
# but centralized script in above form to start/kill would also be useful.
# Alternatively, Schrodinger mpich utility can start serv_p4 correctly
# (without the problem of chp4_sers which results in non-sharable deamons).
For a per-user process, can start/monitor MPICH as:
tools (scripts) in $MPI_HOME/sbin/
chp4_servs -port=4644 # script to start serv_p4 on all nodes, DOESN'T obey MPI_P4SSPORT (def to 1234)
# at some point in the past also used port 1235
chp4_servs -hosts=filename # use filename to get list of hosts to start serv_p4 (def to machines.LINUX)
chp4_servs -hunt # list all serv_p4 process on all mpi nodes (on all ports)
chkserv -port 4644 # see which node don't have mpi daemon running
# DOESN'T obey MPI_P4SSPORT (def to 1234)
# no output = all good.
# (parallel jaguar will trigger it to start anyway)
NOTE: schrodinger has mpich utility to monitor MPICH status also.
Testing MPICH
$MPI_HOME/sbin/tstmachines -v
# see if daemons are fine.
cat $HOME/.server_apps
exact path to each binary, should be populated automatically.
cd $MPI_HOME/examples
mpirun -np 16 cpi
# run pi calculation test on 16 procs.
# doesn't really start a serv_p4 process, so can't use to test sharing daemon b/w users.
Per-User Environment
(There is no need for this unless the shared root daemon process don't work)
MPICH allows a per-user instance of MPICH daemon rings instead of depending on a shared daemon run by root. This has been tested to work with Parallel Jaguar. To use this, add an environment defining the port you want to use with your set of MPI daemon ring. Your 4 digit phone extension would be a good number to use. It maybe best to add it to your $HOME/.cshrc, like this:
setenv MPI_P4SSPORT 4644 #change number to a unique port for yourself
After this, parallel jaguar (or mpirun) jobs should work. If there are problem, check that you have sourced /protos/package/skels/local.cshrc.linux.apps and these variables are defined:
setenv MPI_HOME /protos/package/linux/mpich
setenv PATH "${PATH}:${MPI_HOME}/bin"
setenv MPI_USEP4SSPORT yes
setenv P4_RSHCOMMAND ssh
setenv SCHRODINGER_MPI_START yes # Parallel Jaguar to start its own MPICH serv_p4 on user defined port
After these setup, can run parallel jaguar job like:
$SCHRODINGER/jaguar run -HOST "vic1 vic2 vic3 vic4" -PROCS 4 piperidine
OpenMPI
mpirun hostname
mpirun -H n0301,n0302 hostname
mpirun --hostfile myHostfile hostname
Example mpi host file
n0000
n0001
n0002
n0003
Example mpi host file with cpu info
n0000 slots=16
n0001 slots=16
n0002 slots=16
n0003 slots=16
If FQDN is needed, eg have multiple "domain"/cluster, use this setting:
export OMPI_MCA_orte_keep_fqdn_hostnames=t
some help options I poked at
mpirun --help bind-to
--bind-to {arg0} Policy for binding processes. Allowed values: none,
hwthread, core, l1cache, l2cache, l3cache, socket,
numa, board, cpu-list ("none" is the default when
oversubscribed, "core" is the default when np<=2,
and "socket" is the default when np>2). Allowed
qualifiers: overload-allowed, if-supported,
ordered
AMD Epyc cpu seems to need --bind-to none, cuz of CCX structure?
https://developer.amd.com/spack/hpl-benchmark/
-d|-debug-devel|--debug-devel
Enable debugging of OpenRTE
-v|--verbose Be verbose
-V|--version Print version and exit
Example mpirun commands
# example mpi run with docker container (nvidia gpu test)
CONT='nvcr.io/nvidia/hpc-benchmarks:21.4-hpl' # linpack
CONT='registry.local:443/hpc-benchmarks:21.4-hpl' # linpack
docker run -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl \
mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc --bind-to none -np 2 \
hpl.sh --xhpl-ai --cpu-affinity "0-7:120-127" --gpu-affinity "0:1" --cpu-cores-per-rank 8 --dat /mnt/hpl_dat/abhi.hpl-a40.dat-2.norepeat
# ran on c03 ==> run but poor perf.
# adapted from https://www.pugetsystems.com/labs/hpc/Outstanding-Performance-of-NVIDIA-A100-PCIe-on-HPL-HPL-AI-HPCG-Benchmarks-2149/
docker run -v /global/home/users/tin:/mnt --gpus all \
registry.local:443/tin/hpc-benchmarks:21.4-hpl \
mpirun --mca btl smcuda,self -x MELLANOX_VISIBLE_DEVICES="none" -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc \
--bind-to none --report-bindings --debug-devel \
-np 2 hpl.sh --xhpl-ai --dat /mnt/hpl_dat/abhi.hpl-a40.dat.norepeat.Ns20k --cpu-affinity 0:127 --cpu-cores-per-rank 4 --gpu-affinity 0:1
docker run -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 1 /bin/bash
docker run -v /global/home/users/tin:/mnt --gpus all registry:443/tin/hpc-benchmarks:21.4-hpl mpirun --mca btl smcuda,self -x UCX_TLS=all -np 1 ucx_info -f | grep DEVICES
mpirun arguments
--mca # specify modular Component Arch,
--mca btl # byte transfer layer (P2P -- network stack)
--mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc
# UCX = Unified Communication X [high bandwidth, low-latency network offload with RDMA for IB and RoCE, GPU, shared mem]
MPI stacked with OpenMP -- needed on AMD Epyc with CCX.
Dell method to test JGI node... binary compilation info ??
/set_AMD_params.sh
su - hpl
mpirun -np 16 -x OMP_NUM_THREADS=4 --map-by ppr:1:l3cache:pe=4 --bind-to core -report-bindings -display-map numactl -l ~/amd-hpl/bin/xhpl.4
MPI stacked with OpenMP -- needed on AMD Epyc with CCX.
See
https://developer.amd.com/spack/hpl-benchmark/
and
https://www.dell.com/support/kbdoc/en-in/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
sudo modprobe -v knem
export OMP_NUM_THREADS=4
# threads * PxQ of HPL.dat = total num of core
# their eg was for 2 socket system
# seems like should have 2 threads per socket
mpi_options="--mca mpi_leave_pinned 1 --bind-to none --report-bindings --mca btl self,vader --allow-run-as-root"
mpi_options="$mpi_options --map-by ppr:1:l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores"
mpirun $mpi_options -app ./appfile_ccx
appfile...
# appfile...
# Bind memory to node $1 and four child threads to CPUs specified in $2
# [last digit] 4 core per line, cuz NUM_THREADS=4.
# num of lines * (4 cores per line) = total cores in system
# n02 is EPYC 7343 3.2GHz, 2x16c, so appfile have 8 lines * 4 cores.
# HPL.dat should have PxQ of 8. eg 2x4
# have many entries in appfile is ok... or i should say tolerable...
# so -np 4 ... 8
# change run script not to membind
# some rank would fail
# but as long as enough ranks start per need of HPL.DAT PxQ,
# then things will run
# if need to eek out perf, then really fine tune
# xref: CF_BK/greta/benchmark_epyc_ccx
-np 1 ./xhpl_ccx.sh 0 0-3 4
-np 1 ./xhpl_ccx.sh 0 4-7 4
-np 1 ./xhpl_ccx.sh 0 8-11 4
-np 1 ./xhpl_ccx.sh 0 12-15 4
-np 1 ./xhpl_ccx.sh 1 16-19 4
-np 1 ./xhpl_ccx.sh 1 20-23 4
-np 1 ./xhpl_ccx.sh 1 24-27 4
-np 1 ./xhpl_ccx.sh 1 28-31 4
env var related to xhpl with MPI, OpenMP, the MKL one is
"undocumented" and specific when intel icc compiler is used to create bin for amd...
Once these were set, mpirun without OpenMP threads was performant enough (for hardware acceptance test).
export MKL_DEBUG_CPU_TYPE=5 # this is important to address the CCX of epycs and allow xhpl to run on all cores. OMP isn't really required.
export HPL_HOST_ARCH=3 # to use AVX2 and get 16 DP OPS/cycle
OSU IB microbenchmark on AMD Epyc
https://www.dell.com/support/kbdoc/en-in/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
mpirun -np 16 --allow-run-as-root –host server1,server2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 numactl cpunodebind=${numanode_number} osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
osu microbenchmark, mpirun with explicit req to ucx
# these works:
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca btl self,vader osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ucx ucx_info
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -x LD_LIBRARY_PATH -mca pml ucx osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ucx osu_latency
mpirun -np 2 -npernode 1 -hostfile node2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 osu_latency
mpirun -np 2 -npernode 1 -hostfile node2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 osu_latency
mpirun -np 2 -npernode 1 -hostfile node2 -mca btl ^openib osu_latency # still good perf in lr7, but slow in savio4
mpirun -np 2 -npernode 1 -hostfile node2 -mca pml ob1 osu_latency # still good perf in lr7, but slow in savio4
# ^ is negation in mpirun
# ^tcp means DONT use tcp (not sure what it is using then... openib didn't work
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca btl ^tcp osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca btl ^openib osu_latency # -mca btl openib DONT seems to work
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca btl ^foofoo osu_latency # negating non existing stuff get no errror, just no net changes.
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ^ucx osu_latency # no ucx then? still works... using what?
other -mca btl stuff
-mca btl self,sm,gm
/global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/bin/ompi_info | grep pml
/global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/bin/ompi_info --param pml ucx --level 9
ompi_info --param all all --level 9
# -x can be used to set environment variable
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -x OMPI_MCA_mpi_show_handle_leaks=1 osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -x pml_ucx_verbose=1 osu_latency
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -x UCX_LOG= osu_latency
pml_ucx_verbose
pml_ucx_devices
pml_ucx_opal_mem_hooks
pml_ucx_tls
# Nope:
mpirun -np 2 -npernode 1 -host n0000.lr7,n0001.lr7 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 -x UCX_TLS=rc_x osu_latency
ie -x UCX_TLS=rc_x cause issue
# this seems to corresponds to pml_ucx_tls param
UCX tool
since ucx need to be compiled/installed by itself, it has some binary tools:
ldd /global/software/sl-7.x86_64/modules/gcc/11.3.0/openmpi/5.0.0-ucx/ucx-1.14.0/bin/ucx_info
ucx_info -v
ucx_info -c
ucx_perftest ... need server and client
ref:
https://openucx.readthedocs.io/en/master/running.html has docker capabilities info.
UCX
ucx_info -d | grep Device # -d show device + transport
ucx_info -f | grep DEVICE # -f full decorated output
ucx_info # show help by default
-b Show build configuration
-y Show type and structures information
-s Show system information
-c Show UCX configuration
-a Show also hidden configuration
-f Display fully decorated output
mpi need to be compiled with --with-ucx
add option
--mca pml ucx
Framework: pml
Component: ucx
try:
export OMPI_MCA_orte_base_help_aggregate=0; mpirun --mca pml ucx -np 2 -npernode 1 -H 10.0.44.59,10.0.44.60 -x LD_LIBRARY_PATH osu_latency
export OMPI_MCA_orte_base_help_aggregate=0; mpirun --mca pml ucx -np 2 -npernode 1 --host node1,node2 -x LD_LIBRARY_PATH osu_latency
# node1,node2 are map to IP address over TCP, not IPoIB.
(Hmm... may need to add device info?)
mpirun -np 2 -host n0000.lr7,n0001.lr7 --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x --mca coll_fca_enable 0 --mca coll_hcoll_enable 0 --mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core --mca rmaps_dist_device mlx5_0 -npernode 1 -x LD_LIBRARY_PATH osu_latency
mpirun -np 16 --allow-run-as-root –host server1,server2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 numactl cpunodebind= osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
ref:
OMPI v5.0.x Network Support
Intel MPI
impi_info # intel mpi info
PVM
Ref:
http://www.csm.ornl.gov/pvm/
source pvm.env # get PVM_ROOT, etc
pvm # starts monitor, starting pvmd* daemon if needed.
$PVM_ROOT/lib/pvmd pvmhost.conf
# starts PVM daemon on lists specified in the conf file, whereby hosts is listed one per line.
# may want to put it in background. ^C will end everything.
# it uses RSH (ssh if defined correctly) to login to remote host to start process
# Need to ensure ssh login will source the env correctly for pvm/pvmd to run.
# Can be started by any user (what about more than one user??)
kill -SIGTERM {PVMID} can be used to kill the daemon
if use kill -9 (or other non-catchable signal), be sure to clean up /tmp/pvmd.
pvm> commands
ps
conf
halt
exit
To run OpenEye omega/rocs job, the
$PVM_ROOT/bin/$PVM_ARCH dir must have access to the desired binary (eg sym link to omega).
PATH from .login will not be sourced.
run command as:
omega -pvmconf omega.pvmconf -in carboxylic_acids_1--100.smi -out carboxylic_acids_1--100.oeb.gz -log omega_pvm.log
Each user that start pvm will have her own independent instance of pvmd3.
pvm rsh/ssh to remote host to start itself, so ports numbers are likely not going to be static.
It uses UDP for communication.
from lsof -i4 -n
process name / pid uid ...
pvmd3 27808 tinh 7u IPv4 17619158 UDP 10.220.3.20:33430
pvmd3 27808 tinh 8u IPv4 17619159 UDP 10.220.3.20:33431
tin 27808 1 0 14:25 pts/29 00:00:00 /app/pvm/pvm345/lib/LINUX/pvmd3
## omega.pvmconf
## host = req keyword
## hostname, sometime may need to be FQDN, depending on what command "hostname" returns
## n = number of instance of PVM to run
host phpc-cn01 1
host phpc-cn02 2
host phpc-cn03 2
##/home/common/Environments/pvm.env
# csh environment setup for PVM 3.4.5
# currently only available for LINUX64 (LSF cluster)
setenv PVM_ROOT /app/pvm/pvm345
source ${PVM_ROOT}/lib/cshrc.stub
# http://mail.hudat.com/~ken/help/unix/.cshrc
#alias ins2path 'if ("$path:q" !~ *"\!$"* ) set path=( \!$ $path )'
#alias add2path 'if ("$path:q" !~ *"\!$"* ) set path=( $path \!$ )'
##add2path ${PVM_ROOT}/bin
## : has special meaning in cshrc, so need to escape it for it to be taken verbatim
## there is no auto shell conversion between $manpath and $MANPATH as it does for PATH
## csh is convoluted.
setenv MANPATH $MANPATH\:${PVM_ROOT}/man
Links
hoti1
sn5050
tin6150 sn50