Sys Admin Pocket Survival Guide - SCI
Home	Sci-File	Sci-App	R	Rnote	python	Perl	mpi	lsf

HDF5

.hdf5

Hierarchical Data Format
API/Library to pack large dataset/array into single file (for transfer over network?)
There are serial vs parallel versions.


h5import  	# create hdf5 archive?
h5repack    	# extract single dataset/group from an hdf5 archive ?
h5ls   		# 
h5dump  
h5diff  	ph5diff
h5debug  
h5stat        
h5copy   

h5cc   	h5pcc
h5c++  
h5fc    h5pfc

h5jam  
h5unjam 

h5mkgrp        
h5redeploy  
h5repart  
h52gif  

gif2h5  

h5perf_serial

NetCDF

.nc

NetCDF - Network Common Data Form.
Often used for geospatial data storing array-oriented scientific data (eg input for CMAQ modeling, oceanography, ArcGIS).
Older CDF format were flat, whereas HDF was Hierarchical. Thus HDF originally provided more grouping functionality to organize data, good for long term maintenance of the data "archive".
NetCDF v4 is the HDF5 format, with some restrictions. (typically considered a simpler API with essentially the same functionality)
CDF5 format is from parallel-netcdf project
There are serial vs parallel versions (even for NetCDF v4?)

One can learn to use nc commands without invoking hdf5 commands directly. And if just using it, the choice "between" .hdf5 vs .nc is form the application on hand, unless one is debating writting a new app.

NetCDF file are self describing, having all info in its header. Array Strucutured in header also provide random seek to desired record.


module load  gccc/6.3.0 hdf5/1.8.20-gcc-p netcdf/4.6.1-gcc-p 

ncdump 			# show structure of file in ascii (from NCAR)

ocprint 
nccopy  
ncgen  
ncgen3  
nc-config

Related tools

ncBrowse - view java graphics, 3D viz
ncview - view multi dimention data with changing color map, etc. X-based?
NCL - NCAR Command Language - viz netCDF files (and other formats)

IOAPI

IOAPI is library used by CMAQ to read/write .nc files. (Fortran files are machine specific)
Much of the data in CMAQ are time-series. ie, for the variables/arrays defined, IOAPI automatically apply time stamp or stepping info for them.


JDATE: YYYYDay eg 201935 is Feb 4 of 2019.
JTIME: HHMMSS  (10000 * Hour) + (100 * Minute) + Seconds

Logical filename.  Max 16 chars.  hide NetCDF internal file structure from user/programmer.

IOAPI cmd

From CMAQ manual (v4.6) p50 of pdf

M3XTRACT
extract a subset of variables from a file for a specified time
interval

M3DIFF 
compute statistics for pairs of variables

M3STAT 
compute statistics for variables in a file
build a boundary-condition file for a sub-grid window of
a gridded file

BCWNDW
build a boundary-condition file for a sub-grid window of
a gridded file

M3EDHDR 
edit header attributes/file descriptive parameters

M3TPROC
compute time period aggregates and write them to an output
file

M3TSHIFT 
copy/time shift data from a file

M3WNDW 
window data from a gridded file to a sub-grid

M3FAKE
build a file according to user specifications, filled with
dummy data

VERTOT 
compute vertical-column totals of variables in a file

UTMTOOL 
coordinate conversions and grid-related computations for
Lat/Lon, Lambert, and UTM

CMAQ

Community Modeling ... By EPA.
Ref: CMAQ manual (v4.6)
Modeling components and workflow overview - (see p56 of pdf).

CCTM - CMAQ chemistry-trasnport model - main program of CMAQ modeler. Most of the other programs are pre-processor to prepare data for CCTM use.

Pre-processors

MM5 or WRF - Meteorology model (input?)
SMOKE - Emission model (not part of CMAQ)
MCIP. Meteorological models, produce GRIDDESC (one sub-type of IOAPI format) suitable for CCTM.
ICON - prepare initial condition (eg from ascii), produce netcdf file as input to CCTM. The input are specific to the modeling grid and chem parameterization.
Data source could be ascii or previous CCTM output.
BCON - Boundary Condition. produce .nc file as input for CCTM.
Data source could be ascii or previous CCTM output.
JRPOC - output is a lookup table of phtolysis rate (in clear sky condition), which is needed by CCTM to do its modeling.

Support tools

PARIO - Govern communication of Parrallel run of CCTM.
STENEX - diagnostic tool for CMAQ

BioInformatics File Format

sam

sam/bam/cram :
BCFtools: BCF2/VCF/gVCF
VCF: Variant call format. Mutation centric view

sequence alignment


Pearson/FASTA

FASTA default breaks seq to 80 chars, eg first 3 lines of a DNA seq 

>NODE_1_length_227998_cov_13.115952_pilon
ATATAACTGATATCCCTGAAGGTTATTAACTTGAATTATGACGCACTGATATTATTCATCAAATAATAACAAAATAGCCA
TTGCACCGGGTTGAACAATTTACAAAAGAAAGATATTCCATATCGTATAATGCGATTAAATACGCCGTCTTATAGAAAAT

BEAST2 expect alignment where each seq is just 1 long line for the whole seq 


FASTQ = fasta + quality values (from instrument)

ClustalW
GCC MSF
Phylip interleaved
Phylip sequential

The following phylip example is from : scikit-bio
paup can import it and generate a nexus file without complain.

one key detail seems to be the 2nd number is length of the shortest seq. Thus, if snp-site output .phy is disliked by paup, reducing this 2nd num a tad can help it parse things without bailing out.

Another issue prev found by trial-and error is that the taxa name has to be fixed lenght, used 8.3, if not paup mis read where the sequence begins.
The doc actually says:
Strict PHYLIP requires that each sequence identifier is exactly 10 characters long (padded with spaces as necessary)
I used 8 chars, snp-sites added a tab, that makes 9 char, which is why had to shorten lenght (probably enough by 1).
Future should have all gff files with exactly 9 chars in the filename.

      5    42
Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AAOA

prokka genome annotation related - https://github.com/tseemann/prokka

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by "tbl2asn" to create the .sqn file.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

.nex = Nexus tree format aka PAUP format!
.nex eg
the example .nex below generated by PAUP* 4.0a via command: toNexus / format=PHYLIP fromFile=EG.phy toFile=paup_EG.nex ;

#NEXUS

[!
PHYLIP file "/global/scratch/users/tin/EG.phy"
  (imported Sat Nov  5 22:18:07 2022)
]
Begin data;
    Dimensions ntax=5 nchar=42;
    Format datatype=nucleotide gap=- missing=? matchchar=.;
    Matrix
Turkey      AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
Salmo_gair  AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
H._Sapiens  ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
Chimp       AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
Gorilla     AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
    ;
End;

MrBayes expectation of .nex file differs slightly from that generated by paup (eg when it convert from .phy). maybe MrBayes dont have any import fn, no equiv of "toNexus" from .phy :-\ Differences:

PAUP:    Format datatype=nucleotide gap=- missing=? matchchar=.;
MrBayes: Format datatype=DNA        gap=- missing=? matchchar=.;

does not like quote around 'taxaname' 
s/'//g
does not like tab, which paup adds between taxa and seq data
s/\t/\ /g
may need actual tab char, looking like ^I in list mode.
ALSO, the header and footer section have indentation, replace tab with space seems ok, _ will not work there, so hand edit a few lines.
MrBayes, the nchar= has to be exact.  unlike Paup, it can't be smaller (complain when it read extra stuff that it does not expect)
Dimensions ntax=65 nchar=217746;

PS. if .gff name dont have - in it (all the way from roary step), then taxa name in .nex wont have quotes around them.  paup toNexus added the quote, but mb dislike quote.

MrBayes handle max line length of 99_990, so if seq is longer than that, must use interleave format.

nexus file sketch, eg with tree at the end:

#NEXUS

[!
	comment block (when .nex created by paup
]

Begin taxa;
        Dimensions ntax=71;
        Taxlabels
                1940_AF369693_M112_4_Sudan
                1946_U52403_Nigeria46_Nigeria
                1948_AF369694_A709_4_A2_Uganda
                1953_AH005112_RENDU_Senegal
		[...]
                ;
End;

Begin trees;
        Translate
                1 1940_AF369693_M112_4_Sudan,
                2 1946_U52403_Nigeria46_Nigeria,
                3 1948_AF369694_A709_4_A2_Uganda,
		[...]
                71 2009_TVP11767_Trinidad
                ;
tree TREE1 = [&R] ((((4[&length_range={20.114391506501633,217.16984306215267},height_95%_HPD={55.99999999999977,56.00000000000023},rate_95%_HPD={2.0089799481037352E-4,6.327035462243147E-4},length_95%_HPD={38.79833130234596,169.78101583683446},rate=3.2336282842901544E-4,length=112.03265216906932,rate_median=2.660354276765452E-4,length_median=117.72025942197638,height_median=56.0,rate_range={1.8617778258136157E-4,0.001075846428088821},height_range={55.99...
End;

.nex for paup with commands to execute

#NEXUS

[!
	comment block 
]

Begin data;
    Dimensions ntax=65 nchar=218039;
    Format datatype=nucleotide gap=- missing=? matchchar=. interleave;
    Matrix
'chi__QQ14'  ACA
'chi__QQ16'  ACAA
[! more dta here ]
    ;
End;

begin SKIP__paup;
  log file=mytrees_jh.LOG2;
  set autoclose=yes warnreset=no increase=auto;
  set crit=likelihood;

  nj;
  lset nst=2 tratio=est basefreq=est rates=gamma shape=est;
  lscores 1;  [! 1221  12min on E5, 10min on csl]
  lset tratio=prev basefreq=prev shape=prev;
  hs start=1;  [! 1250 ]
  savetrees file=mytrees.tre replace=yes;

  lset tratio=est basefreq=est shape=est;
  lscores 1;
  lset tratio=prev basefreq=prev shape=prev;
  hs start=1;
  savetrees file=mytrees.tre append=yes;

  lset tratio=est basefreq=est shape=est;
  lscores 1;
  lset tratio=prev basefreq=prev shape=prev;
  hs start=1;
  savetrees file=mytrees.tre append=yes;
end;


ChemInformatics File Format



PDB
                                         
Protein databank:

http://pdb.org or or

http://rcsb.org/pdb/

 


Sample protein:
1tii		A 900+ residues protein, not huge, but a sizable molecule for testing 3D rendering (especially in space filling model)

11AS, 117E	Some pdb file around size of Haemoglobin

1A00            Hemoglobin
1z1g            Topoisomerases, a large protein, "Molecule of the Month" for Dec 2006.





.smi

Smiles - 
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. Some examples are:o
CC  			ethane  	
[OH3+]  		hydronium ion
c1ccccc1  		benzene  	
N[C@H](C)C(=O)O  	D-alanine
COc1cc(cc(OC)c1OC)C(=O)N\N=C(/C)\c2ccc(cc2)S(=O)(=O)N[C@@H](C)C(=O)O	some carboxylic acid
                  

OpenEye Omega take .smi input smiles and produces .oeb.gz files.





.oeb

.oeb receptor file containing active site for docking.
Probably OpenEye propietary.  Fred can open these files (and maybe ROCS?)












Engineering is the art of making compromises.
Science is the reverse engineering of the compromises made by nature.
Medicine is the hacking of the scientific knowledge base.   - A comp sci student :-)














  [Doc URL: http://tin6150.github.io/psg/sci-file.html] 


(cc) Tin Ho. See 
main page
 for copyright info. 






  hoti1

  bofh1