Sys Admin Pocket Survival Guide - SCI
Home	Sci-File	Sci-App	R	Rnote	python	Perl	lsf

Bioinformatic 101

Biostar Handbook Cover getting started with Unix, scripting, many of the bioinformatics file format, what tools are available.

Stack Overflow for bioinformatics :)

GalaxyWeb based bioinformatics toolkit. They also have a lot of tutorial on how to use many of the toolkit available, eg Snippy Tutorial In Galaxy

JBrowse bioinfo file viewer used by Galaxy, avail as downloadable desktop app,

Bioinformatic Pipeline

Overview of sample workflows.

creating fasta

App		IN			OUT		Note
--------------	----------------	-------------	----------------------------------

FastQC							Quality Control
Trimmomatic						QC
Unicycler	paired.fastq.gz		.fasta		de novo assembly (or unpaired)

Genomic analysis

App		IN			OUT		Note
--------------	----------------	-------------	----------------------------------
Abricate	.fasta			.tsv		resistance, virlenge gene match list as TAB-separated value files

mlst		.fasta			.tsv		extract ST# E.coli Sequence Type
ezClermont	.fa			.txt		MLST: extract phylogroup A, B1, B2, C, D...

Phylogenetics analysis


App		IN			OUT		Note
--------------	----------------	-------------	----------------------------------
Prokka		.fasta (contigs)	.gff		annotation of core genes, assembled genome as .gff (best when gff filenames are 9 chars long)
Roary		*.gff			core_gene.aln 	create multi-sequence alignment aligned genome from contigs
Snp-sites	core_gene.aln		alignment.phy		
Paup		alignment.phy		tree		alt: Mr.Bayes, RAxML, Mascot

Finding duplicates. Not much of a workflow here, more like list of programs to try.

App		IN			OUT		Note
--------------	----------------	-------------	----------------------------------
blastn		.fasta			txt		text, table of gene match.... 
clustal omega
muscle
cd-hit-est-2d	2 fasta			.txt.clstr

Bioinformatic Apps

Webified

Cipres Phylo tools

Contig Assembly

FastQC - Quality Control

fastqc  --noextract --threads 52 *.fq.gz -o ./fastqc_output
# --noextract should just means do not save the extracted file, and do not remove the original file.
# there was  option to set TMP extract dir 
# each thread takes like 8G of RAM, check --help for exact amount.  It is a java based program.

Perform initial check of raw reads (usually fastq files). Meaning of outputs:
https://dnacore.missouri.edu/PDF/FastQC_Manual.pdf
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html

Trimmomatic (Input: FastQ)

Unicycler ( via biocontainer )


for i in *_1.fq.gz   # dir contain list of paired fastq files, gz'ed
do
    string1="${i%_1.fq.gz}"  
    echo $string1 >> task.lst
done 

parallel -j 8 -a task.lst  unicycler -1 {}_1.fq.gz -2 {}_2.fq.gz -o ./assembled-seq/{}.fasta --min_fasta_length 500
# each unicycler instance would take 4? cores by default
# use -s instead of -1 -2 combo for unpaired fq.gz 
# there is a -l for long read... RTFM https://github.com/rrwick/Unicycler#installation

# unicycler may fail to generate fasta file.  eg ecuador23 shrimp data
# it is not obvious, there is some crash data in maybe the slurm std err... don't remember.
# but one way to double check it generated fasta file is, well, count *fasta/assembly.fasta and see it is not size 0.
# also,
# egrep "Saving.*fasta$" *fasta/unicycler.log | wc
# ie, look for the the last line in unicycler log that says eg
# Saving /global/scratch/users/tin/fc_graham/ecuador_2023_.../ALL/assembled-sequences_par3/Z_CKDN230030153-1A_HGKHYDSX7_L2.fasta/assembly.fasta

Price - Fast De novo assembly - UCSF Peter Skewes-Cox

ablab/spades: SPAdes Genome Assembler (github.com, maybe old)

Prokka - generate gff, annotation of core genes.
Bakta is newer/better than Prokka ?

Roary is a high speed stand alone pan genome pipeline, which takes annotated assemblies in GFF3 format (produced by Prokka).
128 samples with 1GB RAM desktop computer
https://sanger-pathogens.github.io/Roary/

assemble gff into single MSA alignment file (fasta format):
roary -e -mafft -p 28 *.gff
# -p 28 = 28 threads
# input: *.gff file, 1 per taxa, from prokka
# output: 
#  - core_gene_alignment.aln    # aln for gff above, msa?
#  - pan_genome_reference.fa    # non core genome eg e.coli

# plot for viz
roary_plots.py name_of_your_newick_tree_file.tre gene_presence_absence.csv

Sequence Alignment

Clustal, ClustalW (serial)

MAFFT (C)
mafft --thread -1 --nomemsave gisaid_selection.fasta > gisaid_aln.fasta

Lagan? http://lagan.stanford.edu/lagan_web/index.shtml MSA for WGS of human? 2006.

RC Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–1797 (2004).

Muscle User Guide
muscle -in seqs.fa -out seqs.afa
cant handle long sequences (eg whole e.coli genome) 3 seq

Muscle Parallel

Mauve. Simple to use, but may have some compatibility problem?

LastZ = improved version of blastz

MUMmer http://mummer.sourceforge.net/ version 3.20 from 2007

Whole Genome Alignment (WGA)

Local vs global genomic alignment, Ortholgy mapping, Hierarchical WGA

Homology, Orthology, Toporthology, paralogy, colinear homology.

MUMmer

UCSC Genome Browser

Phylo

Cactus

Evolutionary Genomics book, a chapter covers WGA (2012) [Springer paywall]
Table 1 - List of WGA methods [Spinter paywall]

Global Genomic alignment:
MAVID
LAGAN/Multi-LAGAN
DIALIGN
SeqAn::T-Coffee
FSA
Pecan

Hirarchical WGA:
progressiveMauve
MUGSY

Sequence Alignment and Tree Viewer

belvu (old school Unix X), has wrap long line for printing. custom coloring scheme.
open the core_gene_alignment*aln from (Prokka?)

aliview (alignment editor, GUI) can open core_gene_alignment.aln (from Roary), .phy (from snp-site, but this is not aligned?). very large file is chunked to reduce load time (and memory footprint?). no option to wrap long line

seaview (Unix/X)
No option to wrap long line
sucks up lot of ram for large MSA, slow. able to open the core_gene_alignment*aln from (Prokka?)
Menu to open Mase, Phylip, Clustal, MSF, Fasta, NEXUS. but could not open .phy from (snp-sites?)
has option to open nexus file, but - in taxa name is considered start of sequence, probably collapsed multiple --- into one, breaking the space-based alignment (.nex produced by paup).

IGB - Integrated Genomic Browser (Java)

GenBeans (Java, NetBeans recast, 2000s?)

FigTree - open .tre from RAxML output (java GUI)

IcyTree Drop file into browser to view, but didn't handle .tre from Paup, worked from MrBayes output. YMMV. Can open fasta aln?

DensiTree (cladogram)

Other tree view sw: https://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software

Phylogenetic Tree

MrBayes (+ Beagle gpu lib) Mr Bayes Manual (pdf at their github).
EEOB563 tutorial on MrBayes

mb> log start filename=mblog.txt
mb> execute mb_anim66.nex		# didnt like .nex from paup, need some slight tweaks, see sci-file.html
mb> lset nst=6 rates=gamma		# lset = likelihood model setting
mb> prset shapepr=exponential(0.05)	# setting priors
mb> showmodel
mb> mcmcp ...                                     	# mcmcp = make settings, dont actually start
mb> mcmcp ngen=300000 printfreq=100 samplefreq=100	# ? ~ chain lenght = 300k, for kicking tire maybe better ngen=20k
mb> mcmcp nruns=2 nchains=4 savebrlens=yes		# MB def really: Metropolis-coupled MCMC 2 idp anal, 4 chains each
mb> mcmcp diagnfreq=1000 diagnstat=maxstddev		# converge diag settings (these are def)
mb> mcmcp filename=MB_anim66itv_uni			# filename prefix, _uni for uniform mcmc ??
mb> mcmc						# actually start the analysis, only use 1 cpu
mb> 
mb> sumt						# save tree / show cladogram
mb> sump 						# 
mb> sump filename=
mb> 
mb> ssp							# stepping-stone sampling setting
mb> ssp	...						# stepping-stone sampling setting
mb> ss 							# atually start the stepping-stone analysis
mb>

MrBayes used single CPU core. MrBayes MPI vs Beagle...

RevBayes

ExaBayes

PAUP* Ref: Paup tutorial by P Lewis

paup      primate-mtDNA.nex # can invoke paup and load/execute nexus file as cli arg
paup> exe primate-mtDNA.nex # to load file, it is "exe" cuz nexus file can contain command as to what to do.
paup> log file=paup_primates.log
paup> hsearch                     # perform heuristic search, fast, 
paup> set maxtrees=1000 increase=no; hsearch addseq=random nreps=10 nchuck=100 chuckscore=1; # parsimony rachet  
paup> alltrees                    # perform parsimony search, slow
paup> showtrees all               # AllTree only retain (2) trees, show them
paup> showtrees 1
paup> showtree 2 / taxLabels=truncate semiGraph=yes userBrLens=yes showTaxNum=yes
#                / indicate options , cmds and options arent case sensitive

paup> toNexus ?
paup> toNexus / format=PHYLIP fromFile=ggqrs9.phy toFile=paup_ggqrs9.nex interleaved=yes
# MSA need to have interleaved=yes even when seq are continuous.
# header/name for each seq in MSA need to have same num of chars?  
# else it remove chars from seq and err in processing the alignment
# or issue was that "header" need to be exactly 10 chars long? 
paup> exe paup_ggqrs9.nex

paup> SaveTrees  # store tree(s) to file
paup> SaveTrees / format=Nexus brLens=yes trees=all file=anim66_paup_ml1.nex  # .nex is tree only, score in comment section, no seq data, figtree ok.  PREFER
paup> SaveTrees / format=Newick brLens=yes trees=firstOnly file=anim66.tree ; # figtree can render Newick , but tab in taxa name trip it

paup> GetTrees file=anim66_paup_ml1.nex  # load trees from NEXUS or Newick format 

paup> desc # describeTree 

Note: tree scores are given in -ve log-likelihood, ie -ln(L).

settings for:
pset  parsimony
dset  distance
lset  likelihood 

paup> GammaPlot 	# display gamma distribution plot in ascii terminal

paup> export / format=Nexus file=paup_anim66itv_tax.nex charsPerLine=100 nexusBlocks=taxaChar interleaved=yes
paup> export / format=Nexus file=paup_anim66itv.nex     charsPerLine=100 nexusBlocks=data     interleaved=yes

# essentially, the taxa "block" has an extra list of all the taxa/name from .phy header.
# the data "block" format is thus marginally shorter.  used this for MrBayes

Paup ML search for tree - Tutorial

paup> set crit=l   	# set criterion=likelihood;
paup> automodel 	# if model param are estimated manually, use `lset fixall` to fix these est param into the model.
paup> hsearch		# single core, start ~2:30

paup> lset estall; 	# EstAllParams' specified; all model parameters will be estimated
paup> runraxml; 	# could invoke raxmlHPC via shell to do ml tree generation
			#  /usr/bin/raxmlHPC-PTHREADS-AVX -s /tmp/paup.XXQAhsCT/paupdata.txt -m GTRCAT -c 1 -V --HKY85 -T 2 -n paup -f d -N 1 -p 127526724
paup> lscores;

likelihood search & settings
Per https://phylosolutions.com/paup-tutorial/ : Searches in PAUP* are extremely slow if model parameters are estimated during a tree search. It is almost always better to estimate model parameters on a fixed tree, and then fix those parameters prior to initiating the tree search.
so, avoid using `lset estAllParams`


using /global/software/vector/sl-7.x86_64/modules/paup/4.0a/paup4a168_ubuntu64
singularity container version has an issue where nthreads=[actual machine num of core] takes a long time.  use 1 fewer core is workaround.

Begin Paup;

  log file=mytree.LOG;

  set autoclose=yes warnreset=no increase=auto;
  set crit=likelihood;

  nj;         [! create a seed tree quickly ]

  lset nst=2; [! HKY85+G ]
  lset nst=6; [! GTR+G   ]
  lset nthread=auto;   [! auto set num of threads to num of cores on machine ]
  lset nthread=55;     [! use 1 fewer core than avail for singularity container issue work around ]
  lset tratio=est;     [! estimate nucleotide transversion/translation ratio ]
  lset basefreq=est rates=gamma shape=est;  [! estimate base frequency, use gamma dist for rate, estimate shape ]
  lscores 1;  
  [! lscores above estimate param #MLE  ]
  [! lset below fix these estimated param as model param before running search.  in lieu of fixall ]
  lset tratio=prev basefreq=prev shape=prev;
  hs start=1; [! start heuristic search, get 1 tree only ]

  savetrees file=mytrees.tre replace=yes;   [! overwrite fille if exist ]
  savetrees file=mytrees.tre  append=yes;   [! append tree to file, fig tree pick 1 only though, first or last? ]
End;

BEAST (v1)

     java -cp beast.jar dr.app.beast.BeastMain -seed 2020 -beagle_double -beagle_gpu -save_every 1000000 -save_state travelHist.checkpoint  ../files/Protocol3/282_GISAID_sarscov2_travelHist_masked.xml
     java -cp beast.jar dr.app.tools.TaxaMarkovJumpHistoryAnalyzer -taxaToProcess "hCoV-19/Brazil/SP-02/2020/EPI_ISL_413016|2020-02-28" -stateAnnotation location -burnin 100 -mrsd 2020.174

BEAST (v2)

ImgDir=/global/home/groups/consultsw/sl-7.x86_64/modules/beast2/2.6.4/

singularity exec --nv $ImgDir/beast2.6.4-beagle.sif \
/usr/bin/java -Dlauncher.wait.for.exit=true -Xms256m -Xmx8g -Duser.language=en -cp /opt/gitrepo/beast/lib/launcher.jar beast.app.beastapp.BeastLauncher -beagle_info

singularity exec --nv $ImgDir/beast2.6.4-beagle.sif \
/usr/bin/java -Dlauncher.wait.for.exit=true -Xms256m -Xmx8g -Duser.language=en -cp /opt/gitrepo/beast/lib/launcher.jar beast.app.beastapp.BeastLauncher -beagle_GPU testHKY.xml


treeannotator -b 10 ..."dot"...trees > mcc.tree
# -b 10 take first 10% as burn in (remove these trees)
# BEAST log 2 version of the .trees file, "dash"(smaller file) and "dot"(larger file),  use the "dot" version.
# need to hand edit to remove header line from resulting mcc.tree

RAxML - generate max likelyhood tree, fast, but careful with the stat.

raxmlHPC-PTHREADS-AVX   -s duck.phy          -n duck.tree          -m GTRCAT -f a -x 123 -N autoMRE -p 456  -T 28
-m GTRCAT = model, GTRCAT argued to be one of the best/fast computationally
-f a = perform the ML (Max Likelihood) in same run as bootstrap
-N autoMRE = bootstrap criteria, autoMRE found to work best
-x 123 is random number seed
-p 456 is rnd seed for parsimony inference
-T 28 # number of threads
-s input.phy #  .phy  get from snp-sites -p
-n output fileS actually series prefixed with RAxML_
figtree can open the RAxML_*Tree.*.tre file

# 
## could input be .aln from roary? (skip snp-sites?)

Ref: https://evomics.org/learning/phylogenetics/raxml/ https://isu-molphyl.github.io/EEOB563/computer_labs/lab4/models.html

--model 
(nucleotide): JC, K80, HKY, GTR, etc
(portein):    Blosum62, Dayhoff (PAM),  etc
(binary data): BIN
(...)

RAxML-NG

Dendogram/cladogram

K Tamura, et al., MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28, 2731–2739 (2011).

BAli-Phy

FastTree

GUBBINS - Genealogies Unbiased By recomBinations In Nucleotide Sequences - alignment and tree generation.
https://github.com/nickjcroucher/gubbins

iqtree

rapidNJ - RapidNJ is an algorithmic engineered implementation of canonical neighbour-joinin - (needed by Gubbins) https://github.com/johnlees/rapidnj

Genomic characterization

Abricate = AntiBiotics Resistance Finder. database + extension to find antibiotics genes


AbricateDB_list="vfdb resfinder ecoli_vf"
abricate --db $AbricateDB ${Filename}.fasta/assembly.fasta > ${Filename}_${AbricateDB}.txt
abricate --summary *_${AbricateDB}.txt  > ${AbricateDB}_summary.csv

# https://github.com/tseemann/abricate
# ABRicate can combine results into a simple matrix of gene presence/absence. 
# An absent gene is denoted . and 
# a present gene is represented by its '%COVERAGE`. 
# This can be individual abricate reports, or a combined one.

Abricate - PlasmidFinder

mlsplasmids (web) eg find whether bla_CTX-M gene is plasmid-borne or chromosome bound.

MOB-suite - annotate contig as chromosomal vs plasmid

Abricate VR - find resistance genes

MLST - Multi-Locus sequence typing (ST for bacteria)
Default output columns are Filename PubMLST_scheme_name SeqType Allele_IDs

PubMLST typing scheme (find ST? eg ST131, ST69, etc?)

cgMLST

EnteroBase group ST into STc, Sequence Type complex. Scheme used: Achtman 7 Gene MLST, cgMLST V1 + HierCC V1, rMLST, wgMLST.

cgMLST from The center for genomic epidemiology web based, create auto alignment + categorization.

EzClermont E coli phylotyping tool (A, B1, B2, C, D, E, F, G)
singularity pull --name ezclermont docker://quay.io/biocontainers/ezclermont:0.7.0--pyhdfd78af_0

	ls *.fasta > ezc.input.lst
	cat ezc.input.lst | parallel "ezclermont {} 1>> ezclermont.results.tsv  2>> ezclermont.results.log"

Gene Annotation

prokka - https://github.com/tseemann/prokka - genome annotation (bacterial, archaeal and viral genomes), output std compliant files. annotation = blast for gene info

prokka --outdir mydir      --prefix mygenome contigs.fa
prokka --outdir prokka_K1  --prefix gnm_K1 --centre _pilon --compliant K1.fasta
# --centre [X]       Sequencing centre ID. (default '')
# --centre NAME get stripped
# --compliant        Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
# --prefix           filename for result, get .gff, .tsv, etc.  so 1 prefix per isolate should be good for PRISA.
#                    ## best if basename.gff is 9 chars max, allow 1 tab char added by paup, if need conversion to .nex
# should give a better prefix name, it will be filename and FASTA header in .gff and other downstreap
# but not too long, 37 chars limit
# --force            prokka wants to create the output dir, but prefix will keep files apart.
prokka --outdir PROKKA  --prefix genome_rabbit_R21 --centre _pilon --compliant --force R21.fasta
## input:  fasta file with many fragments (contigs, nodes)
## output: lot of stuff, .gff is suitable for downstream alignment and phy tree generation.  
##         txt, tsv = table of annotated genes

cd-hit, wiki doc

cd-hit finds duplicate protein, or duplicate entries between 2 database. I used it to find if sequnces of two isolate is actually the same, ie from same source bbacteria.

export MAX_SEQ=10000000  # build to support larger sequence length, 10M 
make clean
make


ulimit -s unlimited
/opt/cd-hit/cd-hit-MAX_SEQ-10M/cd-hit-est-2d -M 15000 -T 0 \
-i  A62_CKDN220053932-1A_HK7Y3DSX5_L1.fasta   \
-i2 A63_CKDN220053933-1A_HK77NDSX5_L2.fasta   \
-o A62_A63_cd-hit-est-2d.TXT

similarity scoring result in the .TXT.clstr file
 
# arguments and defaults:
-c 0.9  #  seq identity threshold
-n 10   # word length
-T 1    # num of threads, 0 = use all cpu
-M 800  # ram, in MB.

Reading the .clstr output
biostars post on reading the .clstr output

>Cluster 0                                     # cd-hit create 1 cluster per > entry in fasta from -i
0       686065nt, >1... *     
1       685859nt, >1... at -/100.00%           # - is match from reverse strand


>Cluster 3
0       403385nt, >4... *
>Cluster 4                                     # these are cluster without matches (eg isolate A33 vs A34 in my data)
0       321741nt, >5... *


>Cluster 15
0       59656nt, >16... *
1       4688nt, >39... at -/96.37%
2       4687nt, >40... at -/95.71%
3       1746nt, >58... at -/99.03%
4       1745nt, >59... at +/99.03%
^       ^       └── the >id from inside the fasta file (often just contig chunk number)
│       └──  representative seq (here just specify nucleotide seq length)
└── col 1 = match num


# easier eyeball for non match (ie seq that aren't duplicate)
cat A62_A63_cd-hit-est-2d.TXT.clstr | egrep '^>|^1'

<-- └── ext ascii box drawing character https://theasciicode.com.ar/extended-ascii-code/box-drawing-character-single-line-lower-left-corner-ascii-code-192.html -->

blast, blastn, blastp

blastn is for nucloetide. NCBI web site limit max fasta file input to 1M base (whole E. coli fasta tends to be about 7 M base)

SNP

snp-sites

GATK

snp-sites -p  -o rabbit_K1_R21.phy core_gene_alignment.aln   

input:   MSA alignment eg core_gene_alignment.aln from roary 

output:  phy  - phylip file, may have no gaps as -.  eg used as input by RAxML 


Usage: snp-sites [-mvph] [-o output_filename] {file}
This program finds snp sites from a multi fasta alignment file.
 -r     output internal pseudo reference sequence
 -m     output a multi fasta alignment file (default)
 -v     output a VCF file                                           ##
 -p     output a phylip file
 -o STR specify an output filename [STDOUT]
 -c     only output columns containing exclusively ACGT             ## removed gaps?!
 -b     output monomorphic sites, used for BEAST                    ##
 {file} input alignment file which can optionally be gzipped

snp-sites -cb -o outputfile.aln   inputfile.aln.gz

snippy - rapid haploid variant calling and core genome alignment
https://github.com/tseemann/snippy

snippy-vcf_report - look at variants from vcf -- text or html file output

TBD

Phylip - infer phylogenies. methods: parsimony, distance matrix, likelihood

squizz - convert or check format

squizz -l # list supported format
squizz     core_gene_alignment_ggqrs9.aln -c NEXUS        > core_gene_alignment_ggqrs9.nex
# convert  ^^^^^input file aln FASTA^^^^^ ^^to format^^   ^^ output is to stdout
squizz -A core_gene_alignment_ggqrs9.nex   # check/validate named file

SamTools

SantaCruz genomic browser

BROAD GATK

Galaxy Web browser based bioinformatics toolkit

InGenius? EnGenius?

ParallelStructure (R)

Migrate-N (C)

ModelTest-NG

IMA3

GARLI

G-PhoCS

EPA-NG

ASTRAL (CPU, GPU) Java/C++

MASCOT

art (artemis) - view gff3 files (master genome annotation, eg from prokka)

sequin

exonerate

Bactopia - workflow for bacterial analysis. Use Nextflow WML?

Geneious - GUI app, $575/yr academic, $200/yr student. https://www.geneious.com/pricing/

Data Wrangling Apps

Pipeline tool

Pipeline Pilot

Knime

Orange lab

tableau

TIBCO Spotfire

ChemInfo/StructBio Apps

Schrdoginer: Jaguar ...

Topspin

Charm

OpenEye

http://www.eyesopen.com
Omega - search thru smiles, can be done in parallel easily.
OpenEye support PVM 3.4.4
PVM info, see lsf.html ...


Omega - For database.  Search thru smiles, can be done in parallel easily.
Flipper - ??.
OpenEye support PVM 3.4.4

License file in /b/app/openeye/etc/oe_license.txt
Just update with new version, unix or DOS format okay.

Omega2, PVM and LSF


The following will use lsf to submit a job to run PVM, using omega2 pvm features,
it must be started from a 64-bit OS machine, eg phpc-mn001:

bsub -o lsf.out -e lsf.err -q omega -n 12 pvmjob-openeye omega2 -in ./input.smi -out output.oeb.gz -fraglib /b/app/openeye/data/omega2/fraglib.oeb.gz -log omega2.log

-o will capture standard ouptut to specified file
-e will capture standard error
-q omega is the queue we will use
-n 12    is the number of cpu used.  For now, bsub only allows max of 12, 
         which means spreading to 3 machines, 4 CPU each.
	 Will investigate wayt to make this up to 32.

pvmjob-openeye  is an adapted version of the pvmjob provided by LSF, 
		so far only tested with omega2, but may work for flipper, rocs.

Rest of the command line is omega2 specific parameter, and should be adjusted accordingly.
More importantly, the fraglib may need to point to a user defined database.
When LSF runs, the "-pvmconf" parameter will be appended to the end, with a config
file generated by LSF on the fly.
omega2 produces files like omega2_status.txt to keep user abreast of its progress.

Each user that submit PVM job will have his/her own pvmd daemon running, 
so they are independent.  
However, for the same user, multiple PVM job that overlaps may cause conflict, 
this is not tested yet, may want to run only one job at a time.

Schrodinger

http://www.schrodinger.com

A series of software suite for molecular modeling.

Maestro:
Unified interface for all of schrodinger app, a modeling env for research work.
Free for academic use.

MacroModel: 	molecular modeling
Prime: 		protein structure prediction
Glide:  ligand-receptor docking (virtual screening from HTVS to SP to XP)
Jaguar: ab initio (quantum mechanics) electronic structure package 

Induced Fit:
prediction of ligand induced conformational changes in receptor active sites.
Utilize Prime and Glide

$SCHRODINGER 	= /b/app/schrodinger

Schrodinger FlexLM license

process on lic svr:

sbio     28647     1  0  2006 ?        00:03:49 /b/app/schrodinger/mmshare-v15113/bin/Linux-x86/lmgrd -c /b/app/schrodinger/license -l /b/app/schrodinger/lmgrd.log
sbio     28648 28647  0  2006 ?        00:05:36 SCHROD -T pdir-nis01.geneusa.com 9.5 3 -c /b/app/schrodinger/license -lmgrd_port 6978 --lmgrd_start 4xxxc9c7


su - sbio #uid 700
# Update license file.  beta license can be appended to end of production license file.
# schrodinger-beta/license file can be sym link to the production one.
setenv SCHRODINGER /b/app/schrodinger
$SCHRODINGER/licadmin REREAD

lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved.
lmreread successful


Some time licadmin REREAD on beta license seems to whine about errors, 
but it actually worked, tail the log and in about a minute, it will say license 
reread corretly.

tail -f /b/app/schrodinger/lmgrd.log

Schrodinger Prime Multi-CPU work around

(until Maestro can do ssh b/w different hosts w/o password, which work, but may have other problems. Need more investigation).

 So, for the multi-CPU prime problem, our scientist tried the work 
 around  specified earlier as:
 
 
  multirefine  -LOCAL -HOST prime:4   prime_chad6
  -----------  -----  --------------  -----------
        1	2	    3		   4
 
  1 - prime command for running refine
  2 - this is what will keep it from fizzling. 
  3 - this is the queue and number of cpu to use
  4 - input file name (.inp is optional)

General Schrodinger commands

$SCHRODINGER/jobcontrol -list -c volvoland-0-45b90452
	See jobs submitted by Schrodinger
	-c = see specific JobId, can be omitted to see all jobs
$SCHRODINGER/jobcontrol -delete [jobid]
	Delete a job

$SCHRODINGER/hunt -rtest
	Test ensure queue system listed in schrodinger.hosts  are reachable.
	Only test entries that are hosts entries, not those that are batch queues.

$SCHRODINGER/utilities/mpich status -d
	Check status of mpich as known by Schrodinger.  -d = debug  (def port = 1234)

Schrodinger and MPICH

(See config-backup/sw/mpi/mpich1.test.txt for more info)

Schrodinger 2007 Install guide talks about basic MPICH req, setup.
Jaguar ch 12 discusses Parallel Jaguar and MPICH installation (p294).

req
- MPICH1
- kernel compiled for SMP

two special lib compiled for mpich1 (source included):
- libcmp.so
- libprun.so construct command to run mpirun.

compile mpich with --with-device=ch_p4
RSHCOMMAND=ssh (./configure ... -rsh=ssh)

Even if using ssh, still need $HOME/.rhosts or /etc/hosts.equiv,
serv_p4 needs it! As per Dale Braden of Schrodinger.

SCHRODINGER_MPI_FLAGS="-v" env pass -v (verbose/debug) option to mpirun.

Consider update template and/or submit with path to MPIRUN (add to beginning),
but so far not needed (script in $SCHRODINGER/queues/{LSF|PBS}/.

While Schrodinger use sh/bash for its script, user whose native shell is csh/tcsh
does not need to create .profile/.bashrc. The session to remote host will start with
user's shell and .cshrc, then create the bourne shell, inheriting all the environment
settings.

If each user will have its own ring of MPICH damons, then set env as:
SCHRODINGER_MPI_START=yes
MPI_P4SSPORT=4644 # uniq port per user
MPICH would be started automatically by Schrodinger on demand
(they seems to be actually started by mpirun).

Setting a per-user daemon is said to be the more fail-safe way of running parallel jaguar.
A shared ring of mpich daemon sometime breaks (security, etc).
The most important thing is actually to start serv_p4 on each node of the MPICH ring
manually (as root), and DON'T use the $MPIHOME/sbin/serv_chp4 script,
which somehow don't pass
all the req env for parallel jaguar to run!!
Schrodinger mpich utilities is okay, it does the right job.
Set env as:
SCHRODINGER_MPI_START=no
MPI_P4SSPORT=1235 # static port to be used by all user.
And start process as root (user process is not shareable):
$SCHRODINGER/utilities/mpich start -m $SCHRODINGER_NODES -p $MPI_P4SSPORT
Having an rc script on the head node of a cluster to start this
is acceptable (instead of script on each node that starts serv_p4 individually,
but that may provide way to write to a central log dir).
The $SCHRODINGER_NODES is essentially the machines.LINUX file needed by MPICH.
Bear in mind Jaguar does not support shared memory so DON'T use host:n syntax!

$SCHRODINGER/utilities/mpich subcommands

start           Start servers, really just call MPICH serv_p4 -o -p PORT on each nodes.   # don't use!
stop            Kill servers
restart         Kill and restart servers
status -d       Report server status, -d = debug
pid             Report server PID, but only work if schrodinger started mpich
config          Describe the MPICH configuration.

sems            Report semaphore sets in use
rmsems          Delete all semaphore sets
shm             Report shared memory segments in use
rmshm           Delete shared memory segments (see text below)
rmipcs          Delete both semaphores and shared memory segments

-m HOSTFILE     act on given list of nodes only.  def = $MPI_HOME/.../machines.LINUX
-d              debug
-p 4644         use specified port


If just running MPICH for Schrodinger, mpich start would be good to use.
But definately don't use chp4_servs script, which either don't start for root, 
or create process that is not shareable with other users.
If mpich daemon log is desired, then start serv_p4 manually on each node, 
tell it where to log (see mpi page for details).

Testing Schrodinger Programs


$SCHRODINGER/multirefine -DDEBUG -HOST hpc prime_loop.inp
	Run a prime_loop job with debug options
	Job summited to LSF
	Note that the .inp (and possibly .mae) files need to be unix format.

$SCHRODINGER/bmin -HOST macromodel  tintest
	# need tintest.com and tintest.mae
	# see ~tin/sci/bmin3 for small job that run in ~ 10 min 
	# omit "-HOST queuename" and it will run on local machine.

Schrodinger Jaguar commands


Jaguar "run" jobs will work with MPICH daemon started by user or a shared root process.

$SCHRODINGER/jaguar run -PROCS 4 -HOST remotemachine waterTest
	# command line version of jaguar to run against waterTest.mae
	# -PROCS 4 tell it to run on 4 CPU, which automatically invokes mpirun
	# -HOST "host1 host2" tell it to run on the specified named hosts
	#       they don't need to be defined in schrodinger.hosts file
	# see sci/jaguar_parallel for files

$SCHRODINGER/jaguar run -HOST "node1 node2 node3 node4" -PROCS 4 piperidine
	# (standard sample test file from $schrodinger/jaguar-v.../samples)

$SCHRODINGER/jaguar run -HOST "vic2 vic3 vic4 vic5 vic6" -PROCS 6 parajag-test-feb6-2008-realpara
	# parajag test, adapted from sample (?).  run up to 6 proc only, pjag dies in 8 proc.


$SCHRODINGER/jaguar batch pjag07.bat
	# run batch job defined in the .bat file
	# see sci/jaguar_parallel_perf for files
	# don't seems to run with root mpich daemon :(

Schrodinger Environmanet Vars


setenv SCHRODINGER_JOB_DEBUG 1		# or 2 for even more verbosity to std out.
setenv SCHRODINGER_RETAIN_JOBDIR 1

[MPI related]
SCHRODINGER_MPI_START=yes
SCHRODINGER_MPI_FLAGS="-v" 		# (verbose/debug) option to mpirun.
MPI_HOME=/protos/package/linux/mpich
MPI_USEP4SSPORT=yes			# yes means one mpich daemon per user,
					# which require user to define their own port
					# "no" would mean use a shared (root) process.
MPI_P4SSPORT=1234			# def port is 1234 and would not need to be defined.


needed?
setenv RSHCOMMAND ssh
setenv LM_LICENSE_FILE @flexlm-license

Accelrys

http://www.accelrys.com/


Install location:
/b/app/Accelrys
/opt/Accelrys17beta/SciTegic/linux_bin

1.7 beta key = q4xxxkk, k2xxx3x
default port for scitegic server (installed by discovery studio) 
9944, 9943 for http, https.
change to 9844, 9843 on chris' windows desktop.



pipeline pilot svr cn002

Allegedly Pipeline Pilot server does not support 64-bit linux, but we got it to work, 
albeit error in log.
PP Server is aka Discovery Studio Server.
Only when needing Pipeline Pilot client to modify protocol does the Pipeline Pilot
package need to be installed (and purchased?) separately.


Starting/Stopping server:

cd /opt/Accelrys17beta/SciTegic/linux_bin
./startserver
./stopserver

Logs:
/opt/Accelrys17beta/SciTegic/logs/messages
/opt/Accelrys17beta/SciTegic/apps/scitegic/core/packages_linux32/apache/httpd-2.0.55/logs

Updating FlexLM license:

login to license server (pdir) as user accelrys
backup up old /b/app/Accelrys/LicensePack/Licenses/msi.lic
update /path/to/new/msi.lic with correct port number (1715)
and 2nd line should read:
DAEMON msi /b/app/Accelrys/License_Pack/Linux_2_Intel_32/exe/msid


source /b/app/Accelrys/LicensePack/msi_lic_cshrc
lpver   # should say 7.0.2
source /b/app/Accelrys/LicensePack/etc/lp_cshrc
lp_install /path/to/new/msi.lic

lp_admin	# gui to manipulate/view licenses.


NOTE: don't have any files ending msi.lic in the License File Directory,
or they will all be read as actual msi.lic license file!

Updating beta license:


1.  Replace feature file in \xml\objects\InstallInfo.xml
(scitegicroot=/opt/Accelrys17beta/SciTegic/) 
with a vendor provided file, eg geneusa.xml


2.  Update /opt/Accelrys17beta/LicensePack/Licenses/msidemo.lic ::

This file is created by Discovery Studio!  Do:
source /opt/Accelrys17beta/LicensePack/etc/lp_cshrc 
/opt/Accelrys17beta/LicensePack/linux/bin/lp_admin
	# --> remove expired features
	# step seems optional, probably remove nagging if new license has less features.
cd /opt/Accelrys17beta/DiscoveryStudio17/bin
	./install_temp_license
	# use new temporary key for license eg E8632nX or /7n32mt
	# provided by vendor.
3.  All done.  No server restart is required.  

There was some check for /opt/Accelrys17beta/LicensePack/... msi_server , 
which was not found anywhere.  Alan Lopez said don't worry about it.

Tripos

Overview

Tripos Sybyl is a program to manipulate molecues. Supports stereo hardware for 3D display. It should be viewed as a user GUI program that runs on the user's computer. Sybyl itself has no server/daemon process. Orignially written for SGI for their good graphics support, but Linux linux port is available. Some features like Multi-Processor is only supported in SGI (as of Sybyl 7.2). SGI and most Unix are big endians, but Intel is little endians. Not all file format has build in ability to swap byte orders, thus some of them are not cross platform compatible between SGI and Linux.
Trigo is a wrapper tool to setup the environment and tie the various pieces together, including license check out settings.
Unity is a database and search program to search molecules accordoing to some phamacophoric parameters, etc.
TPC, Tripos Property Service is a daemon (tpsd) that runs on one machine, typically the FlexLM server, so that it can cache and interface info between Sybyl and Unity.
NetBatch is a batching processor, most Sybyl command has build in NetBatch options, but some require special batching commands to process. It is just a queing with the local host, no canned support for LSF. Licensing in a multi-host cluster environment may prove tricky also. There is probably no deamon process, but trigo may need to be running for job to be processed.
Tripos use FlexLM to manage its license. A nodelock license exist if no server is desired.

Sybyl

Starting

trigo -shell sybyl7.3  	# source the enviroment config
sybyl7.3		# start the program. Must be done from xterm; problem with gnome-term.

Environments

TA_LICENSE = /b/app/tripos-linux/AdminTools9.2	# old sybyl 7.2
license    = /b/app/Tripos/AdminTools9.2/tables/license_file # (old SGI)

TA_LICENSE = /b/app/tripos-linux/AdminTools10.8	# sybyl 7.3
license    = /b/app/Tripos/AdminTools10.8/tables/license_file 	

TA_ROOT	   = /b/app/tripos-linux/sybyl7.3

netBatch   = $TA_ROOT/batch/bin/submit.sh
		script define what to do in Solaris, AIX, but no LSF?

Trigo commands

$TA_LICENSE/bin/unix/ta_stat 	# show license info  (FlexLM)

trigo -list		# show what program is avail, and their $TA_ROOT
			# def in app/tripos-linux/trigo/tables/ta_config

trigo -shell sybyl    	# alias, now points to sybyl7.3 also
trigo -shell sybyl7.2	# specify version of the tool

trigo -shell sybyl7.3  	# source the enviroment config
sybyl7.3		# start the program. Must be done from xterm; problem with gnome-term.

NetBatch Config

cd $TA_ROOT	(eg /b/app/tripos-linux/sybyl7.2)
./bin/linux/NetConfig
NetConfig> load
NetConfig> list machine
NetConfig> list connection

Config file is stored in 
/b/app/tripos-linux/sybyl7.2/batch/admin/COMMUNICATION

Some machines need FQDN, other just hostname,
it really depends on how the machine reports $HOSTNAME
Listing both may not be good, as it create too many choices when submitting netbatch jobs.

NetBatch Test


Test 1:
open pdb file
Compute, search, grid search, it should list of machines usable with netbatch.
If it can't find the machine in the netbatch config, then this feature will complain.

-----

Test 2 from Mario using GASP:
Go to Tools, Pharmacophore Alignment , GASP this will open the GASP window.
Then put any Run Name. 
In the input section choose Sybyl MultiMol2 and 
click on the box with the 3 dots so to browse for the file. 
eg use Tin_test.mol2
Once the file loads it will create a MSS. 
Just go to Batch session in GASP and 
select run gasp in batch and pick a machine.
It will be good to set NetBatch options to do logging, at the INFO level.

Then run in Batch.  If machine is remote, right now fails with 
"cannot communicate".  Probably rsh failing.
Job will seems to run, but output get vaporized.

Tripos Galahad test


Need 3 mol2 files, open them into the molecule view window
create a molecular spreadsheet (MSS)
Then run Galahad using this MSS.  Can run in NetBatch if desired.

Details:

File, Read, (open first mol2 file), ok.
Repeat File, Read and open two more mol2 file into separate layers.
Use files from dir ~tinh/sci/galahad/tgr79_galah1_set2new_tinTest/

File, Molecular Spreadsheet, New, Data Source = Molecutes in Mol Areas.
Give it a MSS name and a DB name (db will be saved as file in the dir where sybyl is running).

Tools, Pharmacophore alignment, galahad.
For the MSS source, use the MSS spreadsheet created above.
give it a run name and run it.

If running in NetBatch mode, if successful, SYBYL text window would display message.
Also see galahad_results dir, as well as scores.out file.
log file would also not display any error.

Tripos Property Service (TPS)

TPS is used to store session data when working with UNITY DB.
It starts as an xinetd.d service

setup using a script:   $TA_ROOT/bin/unix/install_tps

The service need to be run as root, with the following settings:
2. Make sure the following line is in your /etc/services file:
   tripos_tpsv41   4080/tcp     #TPS listens to this port
3. Enter: cp /b/app/tripos-linux/tpsv41/tripos_tpsv41 /etc/xinetd.d/tripos_tpsv41
4. Enter: chmod 644 /etc/xinetd.d/tripos_tpsv41
5. Enter: /etc/init.d/xinetd restart

(It is running on pdir-nis01, presumably for Sybyl 7.2 and older.
Subsequent config didn't do anything.  No need to start a new one in 7380.

Licenses

feature should to apear only once, if renewing, comment out old ones.
Last number in the FEATURE line is number of user, not version :)

to specify which port license request will listen to, place the number
at the end of the SERVER line, eg port 1717 for tripos :

SERVER pdir-nis01.geneusa.com 00xxxxxx2b7c 1717
DAEMON triposlm

### - If the SYBYL 7.0 license manager is currently running,
### make the license manager reread the license file by
### entering:
### $TA_LICENSE/bin/unix/ta_reread
### If the SYBYL 7.0 license manager is not running, start
### the license manager by entering:
### $TA_LICENSE/bin/unix/triposlm.sh -up
### - If using Trigo, close the shell by entering:
### exit

so, update update, su - triposl,
/b/app/tripos-linux/AdminTools9.2/bin/unix/ta_reread -c /b/app/tripos-linux/AdminTools9.2/tables/license_file

FlexLM log file that track check out?
/b/app/tripos-linux/AdminTools9.2/LicenseLog

Then the environment will be set correctly for $TA_LICENSE, etc.
and can run ta_reread w/o specifying license path.

In pdir, the wrong license server was still running.
(old sgi version, so ta
/b/app/Tripos_SGI_to_be_removed/AdminTools9.2/bin/unix/triposlm.sh stop

relogin as triposl (l for linux).
/b/app/tripos-linux/AdminTools9.2/bin/unix
./triposlm.sh -up
to start the flexlm vendor daemon.

- fix startup script on pdir -- done.
- cross check licenses to make sure they are good. -- old license files backed up, review if needed.
- fix rollup.env whereby tripos path get mangled up by other intermediate script.

- trigo is the program to start a GUI program for Tripos Sybyl 7.2
- There is a bookshelf (html help files).

-------- tripos license processes old triposl user, sybyl 7.2 -----------------

[triposl@pdir-nis01 unix]$ ps -ef | grep tripos
root     28651 28367  0 Sep14 ?        00:00:00 /bin/sh /etc/rc5.d/S98triposlm start
root     28652 28651  0 Sep14 ?        00:00:00 /bin/sh ./triposlm.sh -up
triposl  22601 22193  0 16:28 pts/6    00:00:00 /bin/sh /b/app/tripos-linux/trigo/trigo -shell sybyl7.2
triposl  23400     1  0 16:32 pts/6    00:00:00 /b/app/tripos-linux/AdminTools9.2/bin/linux/lmgrd -c /b/app/tripos-linux/AdminTools9.2/tables/license_file -l /b/app/tripos-linux/AdminTools9.2/LicenseLog -local
triposl  23402 23400  0 16:32 ?        00:00:00 triposlm -T pdir-nis01.geneusa.com 9.2 3 -c /b/app/tripos-linux/AdminTools9.2/tables/license_file --lmgrd_start 45xxxxx81


license file changed to use port 1717 (or else it defaulted to 6979 (or random port?)
somehow the bloody license under tripos-linux was 1717 for a while, then taken out again.


-------- tripos license processes new tripos user LDAP UID 605, sybyl 7.3 -----------------

USER = tripos !! UID 605, Ankur Gupta and Harold South said this UID is okay.

/etc/rc5.d/S98tripos start


bash-2.05b# ps -ef | grep tripos
tripos   15157     1  0 11:24 pts/0    00:00:00 /b/app/tripos-linux/AdminTools10.8/bin/linux/lmgrd -c /b/app/tripos-linux/AdminTools10.8/tables/license_file -l /b/app/tripos-linux/AdminTools10.8/LicenseLog -local
tripos   15161 15157  0 11:24 ?        00:00:00 triposlm -T pdir-nis01.geneusa.com 10.8 3 -c /b/app/tripos-linux/AdminTools10.8/tables/license_file --lmgrd_start 45xxxxdb
root     15185 13335  0 11:24 pts/0    00:00:00 grep tripos
bash-2.05b#

Tripos DVS, Concord


Session with Sam Pan (zhengp)


/b/home/zgp
run on mordant 

trigo -shell sybyl7.2
sybyl7.2
menu: 
tools, dvs
dvs addons
run diverse solution
session, open
browse to the gprs23.dvs file

data dir for source and output:
/appdata/assays/bioinformatics/ProjectManagement/HTS/computational_chemistry_analysis/
gpr23_clustering.log
                                                                                
trigo, start sybyl7.2 
use bin dir of:
/b/app/tripos-linux/sybyl7.2

ToDoList.txt has more info on environment req, etc.

eg cp $TA_ROOT/partner/PipeComm/pipecomm.cshrc $HOME/.pipecomm_cshrc (dated 2006-040)


env setup files:
.pipecomm_cshrc		--> copied from /b/app/tripos-linux/...   needed by DVS addon in Tripos Sybyl7.2
.concordrc		--> have definitions of the type %Outonerr  that is propietary to the SW, not case sensitive, may break DVS.

CCP4

Install

on volvoland, created link /b/app/ccp4 to 
/net/b/vol/vol2/asf-scratch2/t/tinh/ccpt_test_ins

	ran install.sh ::
Where do you want to install/extract the packages?
	/b/app/ccp4
Where do you want to install python?
	/b/app/ccp4/python			# did not exist before 
 Where do you want to install tcl, tk and blt
	/b/app/ccp4/TclTkBlt		# ghaa had this dir already
 Where do you want to install gsl and cgraph?
	/b/app/ccp4/gsl			# did not exist before  
					# probably not needed, also exist in
					# chooch/lib
 Where do you want to install chooch?
	/b/app/ccp4/ccp4-6.0.2/		# default


python, gsl, probably best left in usr_local or some such dir

Then copy the files to the real /b/app/ccp4
Not everything is needed, and some like setup need renamving from previous version.


removing duplicate or unecessary files:
rm $SCR/bin/Lin/TclTkBlt-bin.tar
rm packages.tar tools.tar

move to real unix drive before copying to /b/app
asf-scratch2-tinh] 406) tar cf - ccp4_test_ins | (cd /net/b/vol/vol2/asf-scratch/user/tinh-old/ ; tar xf - )

SCR=/mnt/b/asf-scratch/user/tinh-old//ccp4_test_ins	# on verso
DST=/b/app/ccp4
mv  $SCR/... $DST/... 

(cd $SCR/ ; tar cf - ccp4-6.0.2 ) | (cd $DST ; tar xf - )
(cd $SCR/ ; tar cf - Coot-0.1.2 ) | (cd $DST ; tar xf - )
(cd $SCR/ ; tar cf - ccp4mg-1.0 ) | (cd $DST ; tar xf - )
mv bin/Lin/Python-bin.tar $DST/bin/Lin

(cd $SCR/ ; tar cf - python ) | (cd $DST ; tar xf - )

sudo cp tmp_setup /b/app/ccp4/tmp_setup.602
sudo cp setup-scripts/csh/ccp4.setup        /b/app/ccp4/setup-scripts/csh/ccp4.setup.602
sudo cp setup-scripts/csh/ccp4-others.setup /b/app/ccp4/setup-scripts/csh/ccp4-others.setup.602
sudo cp setup-scripts/sh/ccp4-others.setup  /b/app/ccp4/setup-scripts/sh/ccp4-others.setup.602 
sudo cp setup-scripts/sh/ccp4.setup         /b/app/ccp4/setup-scripts/sh/ccp4.setup.602

# gsl should not be needed, see above.

created new /b/common/bin/ccp4i.602
sourced updated setup script in /b/app/ccp4/.../setup-script/...

HKL2000

Install

Need to create dir in /usr/local/hklint
Copy all the files from D's computer sfo201838
so that the list of detector shows up.

MOPAC

Terms


1800 1000 taxa = number of species, sample, isolate, etc.  
(tends to be lined up vertically)

sequences lined up horizontally.  simply columns?

CDS = Coding Sequence  (exclude 5' and 3' UTR), include introns?

UTR = UnTraslated Region

TBR = tree-bisection-reconnection, a tree branch-swapping algorithm

[Doc URL: http://tin6150.github.io/psg/sci-app.html]
(cc) Tin Ho. See main page for copyright info.

hoti1
bofh1