AggregatePlotter
AggregateQCStats
AlignmentEndTrimmer
AllelicExpressionComparator
AllelicExpressionDetector
AllelicExpressionMerger
AllelicExpressionRNASeqWriter
AllelicMethylationDetector
AnnotateBedWithGenes
AnnotatedVcfParser
ArupPipelineWrapper
AvatarAssembler
AvatarComparator
BamConcordance
BamContextInspector
BamHg19ToB37Converter
BamBlaster
BamMixer
Bar2Gr
Bar2USeq
Bed2Bar
BedStats
BedTabix
Bed2UCSCRefFlat
BedRegionSplitter
BisSeq
BisSeqAggregatePlotter
BisSeqErrorAdder
BisStat
BisStatRegionMaker
CalculatePerCycleErrorRate
ChIPSeq
ClusterMultiSampleVCF
CollectBamStats
CompareIntersectingRegions
CompareIntersectingVcfs
CompareParsedAlignments
ConcatinateFastas
CorrectVCFEnds
CorrelatePointData
CountChromosomes
BisulfiteConvertFastas
Consensus
CorrelationMaps
ConvertFasta2GCBarGraph
DbNSFPCoordinateConverter
DefinedRegionBisSeq
DefinedRegionDifferentialSeq
DefinedRegionRNAEditing
DefinedRegionScanSeqs
DRDSAnnotator
EnrichedRegionMaker
EstimateErrorRates
ExactBamMixer
ExportExons
ExportIntergenicRegions
ExportIntronicRegions
ExportTrimmedGenes
FastqBarcodeTagger
FastqInterlacer
FastqRenamer
FetchGenomicSequences
FindNeighboringGenes
FindOverlappingGenes
FindSharedRegions
FileCrossFilter
FileMatchJoiner
FileJoiner
FileSplitter
FilterIntersectingRegions
FilterPointData
FoundationVcfComparator
FoundationXml2Vcf
FreebayesVCFParser
GatkCalledSegmentAnnotator
GatkRunner
GeneiASEParser
Graph2Bed
GenerateOverlapStats
Gr2Bar
InosinePredict
IntersectLists
IntersectKeyWithRegions
IntersectRegions
JointGenotypeVCFParser
KeggPathwayEnrichment
KnownSpliceJunctionScanner
LofreqVCFParser
MafParser
MakeSpliceJunctionFasta
MakeTranscriptome
MaskExonsInFastaFiles
MaskRegionsInFastaFiles
MatchMates
MaxEntScanScore3
MaxEntScanScore5
MergeAdjacentRegions
MergeExonMetrics
MergeOverlappingGenes
MergePairedAlignments
MergePointData
MergeRegions
MergeSams
MergeUCSCGeneTable
MethylationArrayScanner
MethylationArrayDefinedRegionScanner
MicrosatelliteCounter
MiRNACorrelator
MpileupParser
MpileupRandomizer
MultipleReplicaScanSeqs
MultiSampleVCFFilter
MutectVCFParser
Mutect4VCFParser
NonReferenceRegionMaker
NovoalignBisulfiteParser
NovoalignIndelParser
NovoalignParser
NovoalignPairedParser
OligoTiler
OverdispersedRegionScanSeqs
ParseExonMetrics
ParseIntersectingAlignments
ParsePointDataContexts
PeakShiftFinder
PointDataManipulator
PoReCNV
Primer3Wrapper
PrintSelectColumns
QCSeqs
QueryIndexer
RandomizeTextFile
RankedSetAnalysis
ReadCoverage
ReferenceMutator
RNAEditingPileUpParser
RNAEditingScanSeqs
RNASeq
RNASeqSimulator
S3UrlMaker
Sam2Fastq
SamFastqLoader
Sam2USeq
SamAlignmentDepthMatcher
SamAlignmentExtractor
SamComparator
SamParser
SamTranscriptomeParser
SamSplitter
SamReadDepthSubSampler
SamSVFilter
SamSubsampler
ScalpelVCFParser
ScanSeqs
SubtractRegions
ScoreChromosomes
ScoreParsedBars
ScoreSequences
Sgr2Bar
Simulator
StrandedBisSeq
SRAProcessor
SubSamplePointData
Tag2Point
TempusJson2Vcf
TempusVcfComparator
Text2USeq
TomatoFarmer
Telescriptor
TNRunner
TRunner
UCSCBig2USeq
USeq2UCSCBig
USeq2Text
VarScanVCFParser
VCFBackgroundChecker
VCFBamAnnotator
VCF2Bed
VCFAnnotator
VCFCallFrequency
VCFComparator
VCFConsensus
VCFFdrEstimator
VCFMerger
VCFMpileupAnnotator
VCFVariantMaker
VCFNoCallFilter
VCFRegionFilter
VCFRegionMarker
VCFReporter
VCFSelector
VCFSpliceScanner
VCFTabix
VCF2Tsv
Wig2Bar
Wig2USeq
ScoreMethylatedRegions
ScoreEnrichedRegions
SomaticSniperVCFParser
StrelkaVCFParser
**************************************************************************************
** Aggregate Plotter: Nov 2017 **
**************************************************************************************
Fetches point data contained within each region, inverts - stranded annotation, zeros
the coordinates, sums, and window averages the values. Usefull for generating
class averages from a list of annotated regions. Use a spreadsheet app to graph the
results.
Options:
-t PointData directories, full path, comma delimited. These should contain chromosome
specific xxx.bar.zip files.
-b Bed file (chr, start, stop, text, score, strand(+/-/.), full path, containing
regions to stack. Must be all the same size.
-p Peak shift, average distance between + and - strand peaks. Will be used to shift
the PointData by 1/2 the peak shift, defaults to 0.
-u Strand usage, defaults to 0 (combine), 1 (use only same strand), 2 (opposite
strand), or 3 (ignore).
this option to select particular stranded data to aggregate.
-r Replace scores with 1.
-f Pad start and stop of each bed region xxx bps, defaults to 0.
-d Delog2 scores. Do it if your data is in log2 space.
-v Convert each region scores to % of total.
-n Divide scores by the number of regions.
-k Divide each regions score by this value.
-l Divide each regions score by the total number of observations.
-s Scale all regions to a particular size. Defaults to max region size.
-a Average region scores instead of summing.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/AggregatePlotter -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -b /Anno/tssSites.bed -p 73 -u 1
-l
**************************************************************************************
**************************************************************************************
** Aggregate QC Stats: May 2019 **
**************************************************************************************
Parses and aggregates alignment quality statistics from json files produced by the
SamAlignmentExtractor, MergePairedAlignments, Sam2USeq, BamConcordance and Fastq rule.
Options:
-j Directory containing xxx.json.gz files for parsing. Recurses through all other
directories contained within.
-r Results directory for writing the summary xls spreadsheets for graphing.
Default Options:
-f FastqCount regex for parsing sample name, note the name must be identical across
the json files, defaults to (.+)_FastqCount.json.gz, case insensitive.
-s SAE regex, defaults to (.+)_SamAlignmentExtractor.json.gz
-m MPA regex, defaults to (.+)_MergePairedAlignments.json.gz
-u S2U regex, defaults to (.+)_Sam2USeq.json.gz
-b BC regex, defaults to (.+)_BamConcordance.json.gz
-p String to prepend onto output file names.
-c Don't calculate detailed region read coverage statistics, saves memory and time.
-v Print verbose debugging output.
-e Replace Exome with DNA in all file reference names.
Example: java -Xmx1G -jar pathToUSeq/Apps/AggregateQCStats -j . -r QCStats/ -p TR774_
**************************************************************************************
**************************************************************************************
** Alignment End Trimmer: Dec 2017 **
**************************************************************************************
This application can be used to trim alignments according to the density of mismatches.
Each base of the alignment is compared to the reference sequence from the start of the
alignment to the end. If the bases match, the score is increased by -m. If the bases
don't match, the score is decreased by -n. The alignment position with the highest
score is used as the new alignment end point. The cigar string, alignment position,
mpos and flags are all updated to reflect trimming.
Notes:
1) Insertions, deletions and skips are currently not counted as matches or mismatches
Required:
-i Path to the orignal alignment, sam/bam/sam.gz OK.
-r Path to the reference sequence, gzipped OK.
-o Name of the trimmed alignment output. Output is bam and bai.
Optional:
-m Score of match. Default 1
-n Score of mismatch. Default 2
-v Verbose output. This will write out detailed information for every trimmed read.
It is suggested to use this option only on small test files.
-l Min length. If the trimmed length is less than this value, the read is switched
to unaligned. Default 10bp
-e Turn on RNA Editing mode. A>G (forward reads) and T>C (reverse reads) are considered
matches.
-s Turn on mismatch scoring mode. Reads with more than -x mismatches are dropped. If
RNA Editing mode is on, A>G (forward reads) and T>C (reverse reads) are considered
matches.
-x Max number of mismatches allowed in max scoring mode. Default 0
Examples:
1) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.bam -o 100X1.trim.bam
-r /path/to/hg19.fasta
2) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.bam -o 100X1.trim.bam
-r /path/to/hg19.fasta -m 0.5 -n 3
3) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.test.bam
-o 100X1.test.trim.bam -r /path/to/hg19.fasta -v
**************************************************************************************
**************************************************************************************
** Allelic Expression Comparator: Oct 2014 **
**************************************************************************************
Looks for changes in allelic expression between two conditions. First run the
AllelicExpressionDetector on each condition. Only snps with minimum # samples and
where in one of the conditions it also passes FDR and log2Rto thresholds.
Required Options:
-d Directory containing nameSnp.obj files to compare from the AllelicExpressionDetector.
-s Save directory.
Default Options:
-f Minimum -10Log10(FDR) for individual condition allelic expression, default 13
-l Minimum abs(log2Ratio) for individual condition allelic expression, default 1
-m Minimum samples in each condition to compare Snp count data, defaults to 2.
-r Full path to R. Defaults to '/usr/bin/R'
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicExpressionComparator -s EyeAEC/
-d EyeAED
**************************************************************************************
**************************************************************************************
** Allelic Expression Detector: Sept 2016 **
**************************************************************************************
Application for identifying allelic expression based on a table of snps and bam
alignments that have been filtered for alignment bias. See the ReferenceMutator and
SamComparator apps. Uses DESeq2 to identify differential expression between alleles.
Required Options:
-n Sample names to process, comma delimited, no spaces.
-b Directory containing coordinate sorted bam and index files named according to their
sample name.
-d SNP data file containing all sample snp calls.
-e Results directory.
-s SNP map bed file from the ReferenceMutator app.
-t Tabix gz indexed bed file of exons where the name column is the gene name, see
ExportExons and https://github.com/samtools/htslib
Default Options:
-g Minimum GenCall score, defaults to 0.25
-q Minimum alignment base quality at snp, defaults to 20
-m Minimum alignment read coverage, defaults to 4
-p Minimum number replicas with heterozygous snp to score, defaults to 3
-r Full path to R (version 3+) loaded with DESeq2, see http://www.bioconductor.org
Type 'library(DESeq2) in R to see if it is installed. Defaults to '/usr/bin/R'
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicExpressionDetector -b Bam/RPENormal/
-n D002-14,D005-14,D006-14,D009-14 -d GenotypingResults.txt.gz -s SNPMap_Ref2Alt_Int.txt
-r RPENormal -t ~/Anno/b37EnsGenes7Sept2016_Exons.bed.gz
**************************************************************************************
**************************************************************************************
** Allelic Expression Merger : Sept 2016 **
**************************************************************************************
App to merge two GeneiASE tables from the AlleleicExpressionDetector and the
AllelicExpressionRNASeqWriter. Where geneName coor duplicates are found, writes out
the first's record to the merged file.
Required Arguments:
-f First GeneiASE table (gene snp.id alt.dp ref.dp)
-s Second GeneiASE table (gene snp.id alt.dp ref.dp)
-m Merged output table results.
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicExpressionMerger -f snpTable.txt
-s rnaSeqTable.txt -m mergedGeneiASETable.txt
**************************************************************************************
**************************************************************************************
** Allelic Expression RNASeq Writer : Sept 2016 **
**************************************************************************************
Application for parsing count data for downstream Allele Specific Gene Expression
detection, e.g. GeneiASE. Avoids snvs with vars within the read length, skips INDELs.
Required Arguments:
-b Bam file with associated index from an RNASeq experiment after filtering for
allelic alignment bias.
-v Vcf file containing snvs to use in extracting alignment counts from the bam. These
will be filtered using the args below before saving.
-t Tabix gz indexed bed file of exons where the name column is the gene name, see
ExportExons and https://github.com/samtools/htslib
-o Output file.
Default Arguments:
-l Read length, defaults to 50
-c Minimum alignment depth for quality bases, defaults to 10
-q Minimum base quality, defaults to 20
-a Minimum Alt alignment depth, defaults to 2
-r Minimum Ref alignment depth, defaults to 2
-m Minimum allele frequency, defaults to 0.05
-x Maximum allele frequency, defaults to 0.95
-p Don't print header
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicExpressionRNASeqWriter -b proc.bam
-v lofreq.vcf.gz -t ~/Anno/b37EnsGenes7Sept2016_Exons.bed.gz -o forGeneiASE.txt.gz
**************************************************************************************
**************************************************************************************
** Allelic Methylation Detector: March 2014 **
**************************************************************************************
AMD identifies regions displaying allelic methylation, e.g. ~50% average mCG
methylation yet individual read pairs show a bimodal fraction distribution of either
fully methylated or unmethylated.
Options:
-s Save directory.
-f Fasta file directory.
-t BAM file directory containing one or more xxx.bam file with their associated xxx.bai
index. The BAM files should be sorted by coordinate and have passed Picard
validation.
-a Minimum number alignments per region, defaults to 15.
-e Minimum number Cs in each alignment, defaults to 6
-m Minimum region fraction methylation, defaults to 0.4
-x Maximum region fraction methylation, defaults to 0.6
-r Full path to R, defaults to /usr/bin/R
-c Converted CG context PointData directories, full path, comma delimited. These
should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. Use the ParsePointDataContexts on the output of the
NovoalignBisulfiteParser to select CG contexts.
-n Non-converted PointData directories, ditto.
-b Provide a bed file (chr, start, stop,...), full path, to scan a list of regions
instead of the genome. See, http://genome.ucsc.edu/FAQ/FAQformat#format1
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicMethylationDetector -s AMD
-f Fastas/ -t Bams/ -c PointData/Con -n PointData/NonCon
**************************************************************************************
**************************************************************************************
** Annotate Bed With Genes Nov 2018 **
**************************************************************************************
Takes a bed like file and a UCSC gene table, intersects them and adds a new column to
the file with the gene names that intersect the gene exons or regions.
Parameters:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds).
-b Bed like file of regions to intersect with genes, gz/zip OK
-i Indexes defining chr start stop columns, defaults to 0,1,2 for bed format.
-r Gzipped results file.
-p Bp padding to expand the bed regions when intersecting with genes.
-g Intersect gene regions with bed, not gene exons.
Example: java -Xmx2G -jar pathTo/USeq/Apps/AnnotateBedWithGenes -p 100 -g -i 1,2,3
-b targetRegions.bed -r targetRegionsWithGenes.txt.gz -u hg19EnsGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Annotated Vcf Parser Sept 2018 **
**************************************************************************************
Splits VCF files that have been annotated with SnpEff w/ dbNSFP and clinvar, plus the
VCFBackgroundChecker and VCFSpliceScanner USeq apps into passing and failing records.
Use the -e option to inspect the effect of the various filters on each record. Use
the VCFRegionFilter app to restrict variants to particular gene regions.
Options:
-v File path or directory containing xxx.vcf(.gz/.zip OK) file(s) to filter.
-s Directory for saving the results.
-f Perform a candidate somatic variant processing. Setting the following overrides
the defaults.
-d Minimum DP alignment depth
-m Minimum AF allele frequency
-x Maximum AF allele frequency
-j Ignore the max AF filter for ACMG incidental germline gene variants.
-p Maximum population allele frequency, only applies if present.
-b Maximum fraction of BKAF samples with allele frequency >= VCF AF, only applies
if present.
-g Splice junction types to scan.
-n Minimum difference in splice junction scores, only applies if present.
-a Comma delimited list of SnpEff ANN impact categories to select for.
-c Comma delimited list of CLINSIG terms to select for.
-e Comma delimited list of CLINSIG terms to select against.
-i Comma delimited list of VCF ID keys to select for. If the VCF ID contains one or
more, the record is passed regardless of other filters. The match is not exact.
-o Only require, if set or present, SnpEff ANN or CLINSIG or Splice to be true to pass.
Defaults to require that all set pass.
-r Verbose per record output.
-y Path to a config txt file for setting the above.
Example: java -jar pathToUSeq/Apps/AnnotatedVcfParser -v VCFFiles/ -s Parsed/
-d 75 -m 0.05 -x 0.75 -j -p 0.02 -b 0.1 -g D5S,D3S,G5S,G3S -n 3.5 -a
HIGH,MODERATE -c Pathogenic,Likely_pathogenic -i Foundation,Tempus -v
**************************************************************************************
**************************************************************************************
** ArupPipelineWrapper: April 2016 **
**************************************************************************************
This app wraps ARUP's pipeline.jar app for generating QC metrics, annotating variants,
and lastly creates the review directory.
Params:
-o Job ID
-m Submitter
-y Analysis type
-w Provide a root path for web links if you'd like to make them.
-i Minimum alignment depth
-t Threads
-s Sample ID, defaults to name of output directory.
-d Path to the output directory
-j Path to the pipeline.jar application
-p Path to the truncated pipeline properties file needing Reference prepending.
-c Path to the properties Reference directory containing the Data, Apps, and Bed dirs.
-q Path to the bed file for coverage QC
-b Path to the bed file for variant calling
-r Path to the fasta reference file w/ index and dict
-u Path to the unfiltered bam file
-f Path to the filtered bam file
-v Path to the final vcf file
-e SnpEff genome, defaults to hg19_ucsc_20150427
-l Upload variants to NGSWeb, defaults to not uploading
Example: java -Xmx4G -jar pathTo/USeq/Apps/ArupPipelineWrapper -o MyJobNix3 -m DNix
-j ~/BioApps/Pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar -y TestAnaly -w
~/WebLinks -i 300 -d Results -p truncPipeProp.xml -c /Pipe/Reference/
-q 0758221_compPad25bp_v1.bed -b 0758221_v1.bed -t 24 -r
~/HCIAtlatl/data/Human/B37/human_g1k_v37_decoy.fasta -u CNV36B_unfiltered.bam -f
CNV36B_final.bam -v CNV36B_snvIndel.vcf
**************************************************************************************
**************************************************************************************
** Avatar Assembler : April 2019 **
**************************************************************************************
Tool for assembling fastq avatar datasets based on the results of three sql queries.
See https://ri-confluence.hci.utah.edu/x/KwBFAg Login as root on hci-clingen1
Options:
-i Info
-d Diagnosis
-g Gender
-p Path to Exp dir w/o Year, e.g. /Repository/PersonData/
-y Year dirs to examine for fastq linking, defaults to 2017,2018,2019,2020,2021
-j Job dir to place linked fastq
-f Only keep patients with a diagnosis containing this String, defaults to all.
-l Create Fastq links for all patient datasets, defaults to just those with both a
Tumor and Normal exome.
-r Patient stats output file.
Example: java -jar -Xmx2G ~/USeqApps/AvatarAssembler -p /Repository/PersonData/
-r avatarAssembler.log.gz -i sampleInfo.txt -d sampleDiagnosis.txt -g
sampleGender.txt > avatarAssemblerProblemSamples.txt -f HEM -y 2018,2019
**************************************************************************************
**************************************************************************************
** Avatar Comparator : Feb 2019 **
**************************************************************************************
Tool for identifying AVATAR datasets that are ready for analysis or need attention.
Options:
-j Patient job directory
-v Verbose output
Example: java -jar -Xmx2G ~/USeqApps/AvatarComparator -j AJobs/
**************************************************************************************
**************************************************************************************
** Bam Concordance: April 2019 **
**************************************************************************************
BC calculates sample level concordance based on uncommon homozygous SNVs found in bam
files. Samples from the same person will show high similarity (>0.9). Run BC on
related sample bams (e.g tumor & normal exomes, tumor RNASeq) plus an unrelated bam
for comparison. Mismatches passing filters are written to file. BC also generates a
variety of AF histograms for checking gender and sample contamination. Although
threaded, BC runs slowly with more that a few bams. Use the USeq ClusterMultiSampleVCF
app to check large batches of vcfs to identify the likely mismatched sample pairs.
WARNING: Mpileup does not check that the chr order is the same across samples and the
fasta reference, be certain they are or mpileup will produce incorrect counts. Use
Picard's ReorderSam app if in doubt.
Note re FFPE derived RNASeq data: A fair bit of systematic error is found in these
datasets. As such, the RNA-> DNA contrasts are low. Yet the DNA->RNA are > 0.9
Options:
-r Path to a bed file of regions to interrogate.
-s Path to the samtools executable.
-f Path to an indexed reference fasta file.
-b Path to a directory containing indexed bam files.
-c Path to a tabix indexed bed file of common dbSNPs. Download 00-common_all.vcf.gz
from ftp://ftp.ncbi.nih.gov/snp/organisms/, grep for 'G5;' containing lines,
run VCF2Bed, bgzip and tabix it with https://github.com/samtools/htslib,
defaults to no exclusion from calcs.
-d Minimum read depth, defaults to 25.
-a Minimum allele frequency to count as a homozygous variant, defaults to 0.95
-m Minimum allele frequency to count a homozygous match, defaults to 0.9
-q Minimum base quality, defaults to 20.
-u Minimum mapping quality, defaults to 20.
-n Minimum fraction similarity to pass sample set, defaults to 0.85
-x Maximum log2Rto score for calling a sample female, defaults to 1.5
-y Minimum log2Rto score for calling a sample male, defaults to 2.5
-e Sample name to ignore in scoring similarity and gender, defaults to 'RNA'
-j Write gzipped summary stats in json format to this file.
-t Number of threads to use. If not set, determines this based on the number of
threads and memory available to the JVM so set the -Xmx value to the max.
Example: java -Xmx100G -jar pathTo/USeq/Apps/BamConcordance -r ~/exomeTargets.bed
-s ~/Samtools1.3.1/bin/samtools -b ~/Patient7Bams -d 10 -a 0.9 -m 0.8 -f
~/B37/human_g1k_v37.fasta -c ~/B37/b38ComSnps.bed.gz -j bc.json.gz
**************************************************************************************
**************************************************************************************
** Bam Context Inspector: Oct 2016 **
**************************************************************************************
Application for scanning the surrounding context of a set of regions for non reference
bps. Use to flag variants with adjacent potentially confounding changes.
Required Options:
-b Sorted bam alignment file with associated index.
-r Bed file of regions to split into pass (no non refs) or fail (with a non ref bp).
-f Path to the reference fasta with and xxx.fai index
Default Options:
-p Bp of flanking bases to scan for non ref bases, defaults to 25
-c Minimum alignment coverage of a base before scanning, defaults to 6
-n Minimum non reference base count, defaults to 2
-q Minimum base quality, defaults to 13
-m Minimum alignment mapping quality, defaults to 13
-a Maximum non ref allele frequency, defaults to 0.03
-x Maximum number non ref bps in flanks, defaults to 1
-i Don't fail regions with an indel in the flanks
Example: java -Xmx4G -jar pathTo/USeq/Apps/BamContextInspector -b Bam/rPENormal.bam
-r rPENormal_calls.bed -r Ref/human_g1k_v37_decoy.fasta -q 20 -m 20 -a 2
**************************************************************************************
**************************************************************************************
** Bam Hg19 to B37 Converter: Aug 2016 **
**************************************************************************************
Cuts off the chr from each reference chromosome, converts chrM to MT, and swaps out
the header to convert hg19 alignments to b37 alignments.
Options:
-b Bam files to covert to b37, a directory with such or a single file.
-e A bam file with a good b37 header to add to the converted hg19 alignments.
Example: java -Xmx1500M -jar pathToUSeq/Apps/BamHg19B37Converter -b . -e ~/b37.bam
**************************************************************************************
**************************************************************************************
** Bam Blaster : April 2019 **
**************************************************************************************
Injects SNVs and INDELs from a vcf file into bam alignments. These and their mates are
extracted as fastq for realignment. For SNVs, only alignment bases that match the
reference and have a CIGAR of M are modified. Not all alignments can be modified.
Secondary/supplemental/not proper are skipped. One var per alignment. Variants within
read length distance of prior are ignored and saved to file for iterative processing.
Be sure to normalize and decompose your vcf file (e.g.https://github.com/atks/vt).
INDELs first base must be reference. Use the ExactBamMixer or BamMixer to add
realignments (e.g. 10%) with the unmodified.bams (e.g. 90%). Use the VCFVariantMaker
to generate random vcf variants or pull a VCF from Clinvar/ Cosmic.
Required:
-b Path to a coordinate sorted bam file with index.
-v Path to a trimmed, normalized, decomposed vcf variant file, zip/gz OK.
-r Full path to a directory to save the results.
-s Max size INDEL, defaults to 50
-d Min alignment depth, defaults to 25
-m Min distance between variants, defaults to 150
Example: java -Xmx10G -jar pathTo/USeq/Apps/BamBlaster -b ~/BMData/na12878.bam
-r ~/BMData/BB0 -v ~/BMData/clinvar.pathogenic.SnvIndel.vcf.gz
**************************************************************************************
**************************************************************************************
** Bam Mixer : April 2018 **
**************************************************************************************
Combines bam alignment files in different fractions to simulate multiple variant
frequencies. Run BamBlaster first.
Required:
-r Path to a directory to save the results
-u Path to the xxx_unmodified.bam from your BamBlaster run
-f Path to the xxx_filtered.bam from your BamBlaster run
-p Path to your realigned paired end bam
-s Path to your realigned single end bam
Optional:
-m Fractions to mix in the variant alignments, comma delimited, no spaces, defaults to
0.025,0.05,0.1,0.2
-v Verbose output.
Example: java -Xmx10G -jar pathTo/USeq/Apps/BamMixer -r ~/TumorSim/
-u ~/bb_unmodified.bam -f ~/bb_filtered.bam -p ~/bb_paired.bam -s ~/bb_single.bam
**************************************************************************************
**************************************************************************************
** Bar2Gr: Nov 2006 **
**************************************************************************************
Converts xxx.bar to text xxx.gr files.
-f The full path directory/file text for your xxx.bar file(s).
Example: java -Xmx1500M -jar pathTo/T2/Apps/Bar2Gr -f /affy/BarFiles/
**************************************************************************************
**************************************************************************************
** Bar 2 USeq: Mar 2011 **
**************************************************************************************
Recurses through directories and sub directories of xxx.bar(.zip/.gz OK) files
converting them to xxx.useq files (http://useq.sourceforge.net/useqArchiveFormat.html).
Required Options:
-f Full path directory containing bar files or directories of bar files.
Default Options:
-i Index size for slicing split chromosome data (e.g. # rows per file),
defaults to 10000.
-r For graphs, select a style, defaults to 0
0 Bar
1 Stairstep
2 HeatMap
3 Line
-h Color, hexadecimal (e.g. #6633FF), enclose in quotations
-d Description, enclose in quotations
-g Reset genome version, defaults to that indicated by the bar files.
-e Delete original folders, use with caution.
-m Replace bar files with new xxx.useq file in bar file directory, use with caution.
Example: java -Xmx4G -jar pathTo/USeq/Apps/Bar2USeq -f
/AnalysisResults/ -i 5000 -h '#6633FF' -g D_rerio_Jul_2010
-d 'Final processed chIP-Seq results for Bcd and Hunchback, 30M reads'
**************************************************************************************
**************************************************************************************
** Bed2Bar: June 2010 **
**************************************************************************************
Bed2Bar builds stair step graphs from bed files for display in IGB. Strands are merged
and text information removed. Will also generate a merged bed file thresholding the
graph at that level.
-f Full path file or directory containing xxx.bed(.zip/.gz OK) files
-v Genome version (eg H_sapiens_Mar_2006), get from UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-s Sum bed scores for overlapping regions, defaults to assigning the highest score.
-t Threshold, defaults to 0.
-g Maximum gap, defaults to 0.
Example: java -Xmx4G pathTo/Apps/Bed2Bar -f /affy/res/zeste.bed.gz -v
M_musculus_Jul_2007 -g 1000 -s -t 100
**************************************************************************************
**************************************************************************************
** BedStats: June 2010 **
**************************************************************************************
Calculates several statistics on bed files where the name column contains a short read
sequence. This includes a read length distribution and frequencies of the 1st and last
bps. Can also trim your read to a particular length.
Options:
-b Full path file name for your alignment bed file or directory containing such. The
name column should contain your just you sequence or seq;qual .
-t Trim the 3' ends of your reads to the indicated length, defaults to not trimming.
-s Calculate base frequencies for the given 0 indexed base instead of the last base.
-r Reverse complement sequences before calculating stats and trimming.
Example: java -Xmx1500M -jar pathToUSeq/Apps/BedStats -b /Res/ex1.bed.gz -s 9 -t 10
**************************************************************************************
**************************************************************************************
** BedTabix: Jan 2013 **
**************************************************************************************
Converts bed files to a SAMTools compressed bed tabix format. Recursive.
Required Options:
-v Full path file or directory containing xxx.bed(.gz/.zip OK) file(s). Recursive!
-t Full path tabix directory containing the compiled bgzip and tabix executables. See
http://sourceforge.net/projects/samtools/files/tabix/
-f Force overwriting of existing indexed bed files, defaults to skipping.
-d Do not delete non gzipped bed files after successful indexing, defaults to deleting.
-e Only print error messages.
Example: java -jar pathToUSeq/Apps/BedTabix -v /VarScan2/BEDFiles/
-t /Samtools/Tabix/tabix-0.2.6/
**************************************************************************************
**************************************************************************************
** Bed 2 UCSC RefFlat June 2015 **
**************************************************************************************
Takes a bed file and a UCSC gene table, intersects them and assigns each bed region
to a gene, then builds a new gene table using the bed region coordinates. Note, each
bed region must intersect only one gene. Modify the input gene table
(MergeUCSCGeneTable and manually trim) based on the errors. Lastly, all bed regions
must be assigned to genes.
Parameters:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds).
-b Bed file of regions to intersect with the gene table.
-t Don't remove UTRs if present, from the gene table.
-r Results file.
Example: java -Xmx2G -jar pathTo/USeq/Apps/Bed2UCSCRefFlat -u refSeqJun2015.ucsc
-b targetRegionsFat.bed -r targetRegionsFat.ucsc
**************************************************************************************
**************************************************************************************
** Bed Region Splitter : June 2017 **
**************************************************************************************
Regions exceeding the chunk size are split into multiple parts.
Required:
-d Path to a file or directory containing such to chunk.
Optional:
-c BP chunk size, defaults to 2000.
Example: java -Xmx4G -jar pathTo/USeq/Apps/BedRegionSplitter -d ToSplit/ -c 5000
**************************************************************************************
**************************************************************************************
** BisSeq: June 2016 **
**************************************************************************************
Takes two condition (treatment and control) PointData from converted and non-converted
C bisulfite sequencing data parsed using the NovoalignBisulfiteParser and scores
regions for differential methylation using either a fisher exact or chi-square test
for changes in methylation. A Benjamini & Hockberg correction is applied to convert
the pvalues to FDRs. Data is only collected on bases that meet the minimum
read coverage threshold in both datasets. The fraction differential methylation
statistic is calculated by taking the pseudomedian of all of the log2 paired base level
fraction methylations in a given window. Overlapping windows that meet both the
FDR and pseLog2Ratio thresholds are merged when generating enriched and reduced
regions. BisSeq generates several tracks for browsing and lists of differentially
methlated regions. To examine only mCG contexts, first filter your PointData using the
ParsePointDataContexts app.
Options:
-s Save directory, full path.
-c Treatment converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files fro the NBP app.
One can also provide a single directory that contains multiple PointData
directories.
-C Control converted PointData directories, ditto.
-n Treatment non-converted PointData directories, ditto.
-N Control non-coverted PointData directories, ditto.
-a Scramble control data.
Default Options:
-d Minimum per base read coverage, defaults to 5.
-w Window size, defaults to 250.
-m Minimum number methy C observations in window, defaults to 5.
-f FDR threshold, defaults to 30 (-10Log10(0.01)).
-l Log2Ratio threshold, defaults to 1.585 (3x).
-r Full path to R, defaults to '/usr/bin/R'
-g Don't print graph files.
Example: java -Xmx10G -jar pathTo/USeq/Apps/BisSeq -c /Sperm/Converted -n
/Sperm/NonConverted -C /Egg/Converted -N /Egg/NonConverted -s /Res/BisSeq
-w 500 -m 10 -l 2 -f 50
**************************************************************************************
**************************************************************************************
** Bis Seq Aggregate Plotter: October 2012 **
**************************************************************************************
BSAP merges bisulfite data over equally sized regions to generate data for class
average agreggate plots of fraction methylation. A smoothing window is also applied.
Data for unstranded, sense, and antisense are produced.
Options:
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. See the NovoalignBisulfiteParser app.
-n Non-converted PointData directories, ditto.
-b Bed file (tab delim: chr start stop name score strand(+/-/.)), full path.
-i Don't invert - stranded regions, defaults to inverting.
-s Scale all regions to a particular size. Defaults to scaling to max region size.
-m Calculate individual base fractions and then take a mean, ignoring zeros, over
the window, instead of summing the obs in the window and taking the fraction.
-o Minimum number of observations before scoring base fraction methylation, defaults
to 8.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/BisSeqAgregatePlotter -c
/NBP/Con -n /NBP/NonCon -b /Anno/tssSites.bed -m
**************************************************************************************
**************************************************************************************
** BisSeqErrorAdder: June 2012 **
**************************************************************************************
Takes PointData from converted and non-converted C bisulfite sequencing data parsed
using the NovoalignBisulfiteParser and simulates a worse non-coversion rate by
randomly picking converted observations and making them non-converted. This is
accomplished by first measuring the non-conversion rate in the test chromosome (e.g.
chrLambda), calculating the fraction of converted C's need to flip to non-converted
to reach the target fraction non-converted and then using this flip fraction
to modify the other chromosome data.
Options:
-s Save directory, full path.
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-n Non-converted PointData directories, ditto.
-f Target fraction non-converted for test chromosome, this cannot be less than the
current fraction.
-t Test chromosome, defaults to chrLambda* .
Example: java -Xmx12G -jar pathTo/USeq/Apps/BisSeqErrorAdder -c /Data/Sperm/Converted
-n /Data/Sperm/NonConverted -f 0.02
**************************************************************************************
**************************************************************************************
** BisStat: June 2018 **
**************************************************************************************
Takes PointData from converted and non-converted C bisulfite sequencing data parsed
using the NovoalignBisulfiteParser and generates several xxCxx context statistics and
graphs (bp and window level fraction converted Cs) for visualization in IGB.
BisStat estimates whether a given C is methylated using a binomial distribution where
the expect can be calculated using the fraction of non-converted Cs present in the
lambda data. Binomial p-values are converted to FDRs using the Benjamini & Hochberg
method. This app requires considerable RAM (10-64G).
Options:
-s Save directory, full path.
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-n Non-converted PointData directories, ditto.
-f Directory containing chrXXX.fasta(/.fa .zip/.gz OK) files for each chromosome.
Default Options:
-p Minimimal FDR for non-converted C's to be counted as methylated, defaults to 20 a
-10Log10(FDR = 0.01) conversion.
-e Expected fraction non-converted Cs due to partial bisulfite conversion and
sequencing error, defaults to 0.005 .
-l Use the unmethylated lambda alignment data to set the expected fraction of
non-converted Cs due to partial conversion and sequencing error. This is
predicated on including a 'chrLambda' fasta sequence while aligning your data.
-o Minimum read coverage to count mC fractions, defaults to 8
-w Window size, defaults to 1000.
-m Minimum number Cs passing read coverage in window to score, defaults to 5.
-r Full path to R, defaults to '/usr/bin/R'
-g Don't merge stranded data, defaults to running a non stranded analysis. Affects CG's.
-a First density quartile fraction methylation threshold, defaults to 0.25
-b Fourth density quartile fraction methylation threshold, defaults to 0.75
Example: java -Xmx12G -jar pathTo/USeq/Apps/BisStat -c /Data/Sperm/Converted -n
/Data/Sperm/NonConverted -s /Data/Sperm/BisSeq -w 5000 -m 10 -f
/Genomes/Hg18/Fastas -o 10
**************************************************************************************
**************************************************************************************
** BisStat Region Maker: Nov 2014 **
**************************************************************************************
Takes serialized window objects from BisStat, thresholds based on the min and max
fraction methylation params and prints regions in bed format meeting the criteria.
May also build regions base on the density of a given fraction methylation quartile.
For example, to identify regions where at least 0.8 of the sequenced Cs are low
methylated (<= 0.25 default settings in BisStat) set -q 1 -m 0.8 . To find regions of
with >= 0.9 of the Cs with high methylation (>= 0.75 default BisStat setting), set
-q 3 -m 0.9 .
Options:
-s SerializedWindowObject directory from BisStat, full path.
-m Minimum fraction.
-x Maximum fraction.
-g Maximum gap, defaults to 0.
-q Merge windows based on their quartile density score, not fraction methylation, by
indicating 1,2,or 3 for 1st, 2nd+3rd, or 4th, respectively.
-r Full path to R, defaults to '/usr/bin/R'
Example: java -Xmx4G -jar pathTo/USeq/Apps/BisStatRegionMaker -m 0.8 -x 1.0 -g 100
-s /Data/BisStat/SerializedWindowObjects
**************************************************************************************
**************************************************************************************
** Calculate Per Cycle Error Rate : July 2015 **
**************************************************************************************
Calculates per cycle snv error rate provided a sorted indexed bam file and a fasta
sequence file. Only checks CIGAR M bases not masked or INDEL bases.
Required Options:
-b Full path to a coordinate sorted bam file (xxx.bam) with its associated (xxx.bai)
index or directory containing such. Multiple files are processed independently.
-f Full path to the single fasta file you wish to use in calculating the error rate.
Default Options:
-s Perform separate first read second read analysis, defaults to merging.
-c Maximum fraction failing cycles, defaults to 0.1
-1 Maximum first read or merged read error rate, defaults to 0.01
-2 Maximum second read error rate, defaults to 0.0175
-o Write coverage statistics to this log file instead of stdout.
-j Write summary stats in json format to this file. Only stats for the first bam file
are saved. Only separate strand analysis permitted.
-m Set minimum mapping quality for inclusion. Default: 0.
-p Require that a read be mapped in a proper pair for inclusion in error rate calculations.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/CalculatePerCycleErrorRate
-b /Data/Bam/ -f /Fastas/chrPhiX_Illumina.fasta.gz
**************************************************************************************
**************************************************************************************
** ChIPSeq: May 2014 **
**************************************************************************************
The ChIPSeq application is a wrapper for processing ChIP-Seq data through a variety of
USeq applications. It:
1) Parses raw alignments (sam, eland, bed, or novoalign) into binary PointData
2) Filters PointData for duplicate alignments
3) Makes relative ReadCoverage tracks from the PointData (reads per million mapped)
4) Runs the PeakShiftFinder to estimate the peak shift and optimal window size
5) Runs the MultipleReplicaScanSeqs to window scan the genome generating enrichment
tracks using DESeq2's negative binomial pvalues and B&H's FDRs
6) Runs the EnrichedRegionMaker to identify likely chIP peaks (FDR < 1%, >2x).
Options:
-s Save directory, full path.
-t Treatment alignment file directories, full path, comma delimited, no spaces, one
for each biological replica. These should each contain one or more text
alignment files (gz/zip OK) for a particular replica. Alternatively, provide
one directory that contains multiple alignment file directories.
-c Control alignment file directories, ditto.
-y Type of alignments, either novoalign, sam, bed, or eland (sorted or export).
-v Genome version (e.g. H_sapiens_Feb_2009, M_musculus_Jul_2007), see UCSC FAQ,
http://genome.ucsc.edu/FAQ/FAQreleases.
-r Full path to R, defaults to '/usr/bin/R'. Be sure to install DESeq2, gplots, and
qvalue Bioconductor packages.
Advanced Options:
-m Combine any replicas and run single replica analysis (ScanSeqs), defaults to
using DESeq2.
-a Maximum alignment score. Defaults to 60, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. Set to 0 for RNASeq data.
-p Peak shift, defaults to the PeakShiftFinder peak shift or 150bp. Set to 0 for
RNASeq data.
-w Window size, defaults to the PeakShiftFinder peak shift + stnd dev or 250bp.
-i Minimum number reads in window, defaults to 10.
-f Filter bed file (tab delimited: chr start stop) to use in excluding intersecting
windows while making peaks, e.g. satelliteRepeats.bed .
-g Print verbose output from each application.
-e Don't look for reduced regions.
Example: java -Xmx2G -jar pathTo/USeq/Apps/ChIPSeq -y eland -v D_rerio_Dec_2008 -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/PolIINRep1/,/Data/PolIINRep2/ -s
/Data/Results/WtVsNull -f /Anno/satelliteRepeats.bed
**************************************************************************************
**************************************************************************************
** Cluster Multi Sample VCF: Nov 2014 **
**************************************************************************************
Clusters samples based on the genotypes of each that differ in one or more samples.
Options:
-v Full path to a multi sample vcf file (xxx.vcf/xxx.vcf.gz)). Note, Java often fails
to parse tabix compressed vcf files. Best to uncompress.
-r Minimum record QUAL score, defaults to 20.
-g Minimum sample genotype GT score, defaults to 20.
-i Use sample index instead of trimmed name in output.
-c Minimum # samples with given genotype, defaults to 1.
Example: java -Xmx2G -jar pathTo/USeq/Apps/ClusterMultiSampleVCF -v ~/UGP/suicide.vcf
**************************************************************************************
**************************************************************************************
** Collect Bam Stats: Dec 2014 **
**************************************************************************************
Parses and plots bam alignment quality statistics from a log file containing the
output of the MergePairedAlignments and Sam2USeq apps. Will flag datasets that fail
the set thresholds.
Options:
-l Directory containing a combine log file of MergePairedAlignments and Sam2USeq,
one per sample.
Default Options:
-x Minimum alignment coverage threshold, defaults to 10.
-c Minimum fraction interrogated bases at the coverage threshold, defaults to 0.95
-u Maximum fraction unmapped reads, defaults to 0.01
-d Maximum fraction duplicate reads, defaults to 0.15
-p Minimum fraction passing alignments, defaults to 0.8
-o Maximum fraction overlapping bps in paired alignments, defaults to 0.1
Example: java -Xmx1500M -jar pathToUSeq/Apps/CollectBamStats -l /QC/Sam2USeqLogs/
-x 15 -c 0.9
**************************************************************************************
**************************************************************************************
** Compare Intersecting Regions: Nov 2012 **
**************************************************************************************
Compares test region file(s) against a master set of regions for intersection.
Reports the results as columns relative to the master. Assumes interbase coordinates.
Options:
-m Full path for the master bed file (tab delim: chr start stop ...).
-t Full path to the test bed file to intersect or directory of files.
-g Maximum bp gap allowed for scoring an intersection, defaults to 0 bp. Negative gaps
force overlaps, positive gaps allow non intersecting bases between regions.
Example: java -Xmx4G -jar pathTo/Apps/CompareIntersectingRegions -g 1000
-m /All/mergedRegions.bed.gz -t /IndividualERs/
************************************************************************************
**************************************************************************************
** Compare Intersecting Vcfs : February 2019 **
**************************************************************************************
Compares vcf files by creating a master list of variants and then scores each for the
presense of the same CHROM POS ALT REF in each vcf file.
Options:
-v A directory of vcf files to compare (xxx.vcf(.gz/.zip OK)).
-r Name of a spreadsheed results file, should end in xxx.txt.gz
Example: java -Xmx10G -jar pathTo/USeq/Apps/CompareIntersectingVcfs -v VCFs/
-r comparisonVcf.txt.gz
**************************************************************************************
**************************************************************************************
** Compare Parsed Alignments: Nov 2009 **
**************************************************************************************
Compares two parsed alignments for a common distribution of snps using R's Fisher's
Exact. Run the ParseIntersectingAlignments with the same snp table first.
Options:
-a Full path file name for the first xxx.alleles file.
-b Full path file name for the first xxx.alleles file.
-d Full path directory name for writing temporary files.
-r Full path file name for R, defaults to '/usr/bin/R'
Example: java -Xmx1500M -jar pathToUSeq/Apps/CompareParsedAlignments.
-a /SeqData/lymphSNPs.alleles -b /SeqData/normalSNPs.alleles -b /temp/
**************************************************************************************
**************************************************************************************
** Concatinate Fastas: Oct 2010 **
**************************************************************************************
Concatinates a directory of fasta files into a single sequence seperated by a defined
number of Ns. Outputs the merged fasta as well as bed files for the junctions and
spacers as well as a file to be used to shift UCSC gene table annotations. Use this
app to create artificial chromosomes for poorly assembled genomes.
Options:
-d Full path directory for saving the results.
-f Full path directory containing fasta files to concatinate.
-n Number of Ns to use as a spacer, defaults to 1000.
-c Name to give the concatinate, defaults to chrConcat .
Example: java -Xmx4G -jar pathTo/USeq/Apps/ConcatinateFastas -n 2000 -d
/zv8/MergedNA_Scaffolds -f /zv8/BadFastas/ -c chrNA_Scaffold
**************************************************************************************
**************************************************************************************
** Correct VCF Ends: July 2017 **
**************************************************************************************
Use to correct the END=xxx tags in a Crossmap vcf . Removes any MC tags. Adds chr.
Required Options:
-v Path to the Crossmap vcf file.
-b Path to the VCF2Bed -> Crossmap bed file
-r Path to save the modifed gzipped file.
Ex: java -jar USeq_XXX/Apps/CorrectVCFEnds -v b38.vcf -b b38.bed -r finalB38.vcf.gz
**************************************************************************************
**************************************************************************************
** CorrelatePointData: Aug 2011 **
**************************************************************************************
Calculates a Pearson Correlation Coefficient on the values of PointData found with the
same positions in the two datasets. Do NOT use on stair-step/ heat-map graph data.
Only use on point representation data.
Options:
-f First PointData set. This directory should contain chromosome specific xxx.bar.zip
files, stranded or unstranded.
-s Second PointData set, ditto.
-p Full path file name to use in saving paired scores, defaults to not printing.
Example: java -Xmx4G -jar pathTo/USeq/Apps/CorrelatePointData -f /BaseFracMethyl/X1
-s /BaseFracMethyl/X2
**************************************************************************************
***************************************************************
* CountChromosomes *
* *
* This script drives samtools view command. It will create *
* a report that lists counds to standard chroms, extra *
* chroms, phiX and adatpter. This data will be used in the *
* ParseMetrics App. *
* *
* -i Input file (bam format) *
* -o Output file (.txt format) *
* -r Reference (hg19, hg18, mm10, mm9 etc. *
* -p path to samtools *
***************************************************************
Output File not specified, exiting
**************************************************************************************
** Bisulfite Convert Fastas: Dec 2008 **
**************************************************************************************
Converts all the c/C's to t/T's in fasta file(s) maintaining case.
Required Parameters:
-f Full path text for the xxx.fasta file or directory containing such.
Example: java -Xmx2000M -jar pathTo/Apps/BisulfiteConvertFastas -f /affy/Fastas/
**************************************************************************************
**************************************************************************************
** Consensus : March 2017 **
**************************************************************************************
Consensus clusters alignments sharing the same unclipped start position and molecular
barcode. It then calls consensus on the clustered alignments outputing fastq for
realignment and unmodified bam records. After running, align the fastq files and merge
the new bams with those in the save directory.
Required arguments:
-b Path to the mate matched bam file created by FastqBarcodeTagger | cutadapt | bwa |
MatchMates. See FBT and MM for details.
Optional Arguments:
-s Path to a directory to save the results, defaults to a derivative of the
bam file.
-t Number concurrent threads to run, defaults to the max available to the jvm / 2.
-c Number of alignments to process in one chunk, defaults to 1,000,000. Adjust for the
availible RAM.
-x Maximum number of alignments to cluster before subsampling, defaults to 20000.
-q Minimum barcode base quality, defaults to 13, anything less is assigned an N.
-n Minimum number of non N barcode bases, defaults to 7, anything less is tossed.
-f Minimum fraction barcode identity for inclusion in a cluster, defaults to 0.875 .
-u Minimum read base quality for inclusion in consensus calling, defaults to 13.
-r Minimum read base fraction identity to call a consensus base, defaults to 0.66 .
Anything less is assigned an N.
Example: java -Xmx100G -jar pathTo/USeq/Apps/Consensus -b MM/passingMM.sorted.bam
**************************************************************************************
**************************************************************************************
** Correlation Maps: Nov 2007 **
**************************************************************************************
CM calculates a correlation score for each window of genes and using permutation, an
empirical p-value. The correlation score is the mean of all pair Spearman ranks for
the gene expression profiles in each window. If a single value is given (unlogged!) for
each gene, a mean of the scores within each window is calculated.
To calculate p-values, X randomized datasets are created by shuffling the expression
profiles between genes, windows are scored and pooled. P-values for each real
score are calculated based on the area under the right side of the randomized score
distribution. In addition to a spread sheet report summary, heat map xxx.bar files
for the p-values and mean correlation are created for visualization in IGB.
Note, this analysis is not stranded. If so desired parse lists appropriately.
Parameters:
-f The full path file text for a tab delimited gene file (text,chr,start,stop,scores)
-o GenomicRegion filter file, full path file text for a tab delimited region file to use in
removing genes from correlation analysis. (chrom, start, stop).
-g Genome version for IGB visualizations (e.g. C_elegans_May_2007).
-w Window size, default is 50000bp. Setting this too small may exclude some regions.
-n Minimum number of genes required in each window, defaults to 3. Setting this too
high will exclude some regions.
-r Number random trials, defaults to 100
Example: java -Xmx256M -jar pathTo/T2/Apps/CorrelationMaps -f /Mango/geneFile.txt
-w 30000 -n 2 -o /Mango/operons.txt
**************************************************************************************
**************************************************************************************
** Convert Fasta 2 GC Bar Graphs: April 2011 **
**************************************************************************************
Converts fasta files into graph files containing a 1 over each C in a CpG context.
Required Parameters:
-f Full path name for the directory containing xxx.fasta(.gz/.zip OK).
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Example: java -Xmx4G -jar pathTo/Apps/ConvertFasta2GCBarGraph -f /affy/Fastas/
-v H_sapiens_Feb_2009
**************************************************************************************
**************************************************************************************
** DbNSFP Coordinate Converter: Dec 2017 **
**************************************************************************************
Walks a directory of dbNSFP files swapping the B38 coordinates with the B37, splits
by chromosome, sorts, and writes out the final composite.
Options:
-d Path to a directory of dbNSFP files to parse.
-s Path to a directory for saving the results.
Example: java -Xmx20G -jar pathToUSeq/Apps/DbNSFPChrSplitter -d DbNSFP3.5a
-s B37_DbNSFP3.5a
**************************************************************************************
**************************************************************************************
** Defined Region Bis Seq: Dec 2013 **
**************************************************************************************
Takes two condition (treatment and control) PointData from converted and non-converted
C bisulfite sequencing data parsed using the NovoalignBisulfiteParser and scores user
defined regions for differential methylation using either a fisher or chi-square test.
A Benjamini & Hockberg correction is applied to convert the pvalues to FDRs. Data is
only collected on Cs that meet the minimum read coverage threshold in both datasets.
The fraction differential methylation statistic is calculated by taking the
pseudomedian of all of the log2 paired base level fraction methylations in a given
region. To examine particular mC contexts (e.g. mCG), first filter your PointData
using the ParsePointDataContexts app.
Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-s Save directory, full path.
-c Treatment converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files fro the NBP app.
One can also provide a single directory that contains multiple PointData
directories.
-C Control converted PointData directories, ditto.
-n Treatment non-converted PointData directories, ditto.
-N Control non-coverted PointData directories, ditto.
Default Options:
-d Minimum per base read coverage, defaults to 5.
-r Full path to R, defaults to '/usr/bin/R'
Example: java -Xmx10G -jar pathTo/USeq/Apps/DefinedRegionBisStat -c /Sperm/Converted
-n /Sperm/NonConverted -C /Egg/Converted -N /Egg/NonConverted -s /Res/DRBS
-b /Res/CpGIslands.bed
**************************************************************************************
**************************************************************************************
** Defined Region Differential Seq: Aug 2016 **
**************************************************************************************
DRDS takes sorted bam files, one per replica, minimum one per condition, minimum two
conditions (e.g. treatment and control or a time course/ multiple conditions) and
identifies differentially expressed genes using DESeq2 or SAMTools. DESeq2's rLog
normalized count data is used to heirachically cluster the samples. Differential
splicing is estimated using a chi-square test of independence. When testing only a
few genes or regions, append these onto a full gene table so that DESeq2 can
appropriately estimate the library size and replica variance.
Options:
-s Save directory.
-c Conditions directory containing one directory for each condition with one xxx.bam
file per biological replica and their xxx.bai indexs. 3-4 reps recommended per
condition. The BAM files should be sorted by coordinate using Picard's SortSam.
All spice junction coordinates should be converted to genomic coordinates, see
USeq's SamTranscriptomeParser.
-r Full path to R (version 3+) loaded with DESeq2, samr, and gplots defaults to
'/usr/bin/R' file, see http://www.bioconductor.org . Type 'library(DESeq2);
library(samr); library(gplots)' in R to see if they are installed.
-u UCSC RefFlat or RefSeq gene table file, full path. Tab delimited, see RefSeq Genes
http://genome.ucsc.edu/cgi-bin/hgTables, (uniqueName1 name2(optional) chrom
strand txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 C1orf64 chr1 + 16203317
16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 . NOTE:
this table should contain only ONE composite transcript per gene (e.g. use
Ensembl genes NOT transcripts). Use the MergeUCSCGeneTable app to collapse
transcripts. See http://useq.sourceforge.net/usageRNASeq.html for details.
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
-g Genome Version (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-f Turn off DESeq2 independent filtering.
Advanced Options:
-m Mask overlapping gene annotations, recommended for well annotated genomes.
-x Max per base alignment depth, defaults to 50000. Genes containing such high
density coverage are ignored.
-n Max number alignments per read. Defaults to 1, unique. Assumes 'NH' tags have
been set by processing raw alignments with the SamTranscriptomeProcessor.
-e Minimum number alignments per gene-region per replica, defaults to 10.
-i Score introns instead of exons.
-p Perform a stranded analysis. Only collect reads from the same strand as the
annotation.
-j Reverse stranded analysis. Only collect reads from the opposite strand of the
annotation. This setting should be used for the Illumina's strand-specific
dUTP protocol.
-k Second read's strand is flipped. Otherwise, assumes this was not done in the
SamTranscriptomeParser.
-t Don't delete temp files (R script, R results, Rout, etc..).
-a Run SAMseq in place of DESeq2. This is only recommended with five or more
replicates per condition.
-v Use these 3 -10Log10(AdjPVal) thresholds, comma delimited, no spaces, defaults
to 10,20,30
-w Use these 3 absolute log2 ratio thresholds, comma delimited, no spaces, defaults
to 0.585,1,1.585
-y Add in non phred AdjPVal columns, defaults to excluding.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionDifferentialSeq -c
/Data/TimeCourse/ESCells/ -s /Data/TimeCourse/DRDS -g H_sapiens_Feb_2009
-u /Anno/mergedHg19EnsemblGenes.ucsc.gz -w 0.322,0.585,1 -y
**************************************************************************************
**************************************************************************************
** Defined Region RNA Editing: April 2014 **
**************************************************************************************
DRRE scores regions for the pseudomedian of the base fraction edits as well as the
probability that the observations occured by chance using a permutation test based on
the chiSquare goodness of fit statistic.
Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-e Edited PointData directory from the RNAEditingPileUpParser.
These should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. These will be merged when scanning.
-r Reference PointData directory from the RNAEditingPileUpParser. Ditto.
-a Minimum base read coverage, defaults to 5.
-t Run a stranded analysis, defaults to non-stranded.
-i Remove base fraction edits that are non zero and represented by just one edited
base.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionRNAEditing -b hg19UTRs.bed
-e /PointData/Edited -r /PointData/Reference
**************************************************************************************
**************************************************************************************
** Defined Region Scan Seqs: March 2011 **
**************************************************************************************
DRSS takes chromosome specific PointData xxx.bar.zip files and extracts scores under
each region to calculate several statistics including a binomial p-value, Storey
q-value FDR, an empirical FDR, a p-value for strand skew, and a chi-square test of
independence between the exon read count distributions between treatment and control
data (a test for alternative splicing). Several measures of read counts are provided
including counts for each strand, a normalized log2 ratio, and RPKMs (# reads per kb
of interrogated region per total million mapped reads). If a gene table is provided,
scores under each exon are summed to give a whole gene summary. It is also recommended
to run a gene table of introns (see the ExportIntronicRegions app) to look for
intronic retention and novel transfrags/ exons. If one provides splice junction bed
files for treatment and control RNA-Seq data, see the NovoalignParser, splice
junctions will be scored for differential expression. This is an additional
calculation unrelated to the chi-square independance test. Lastly, if control
data is not provided, simple region sums are calculated.
Options:
-s Save directory, full path.
-t Treatment PointData directories, full path, comma delimited. These should
contain unshifted stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-c Control PointData directories, ditto.
-p Peak shift, average distance between + and - strand peaks for chIP-Seq data, see
PeakShiftFinder. For RNA-Seq set to the smallest expected fragment size. Will
be used to shift the PointData 3' by 1/2 the peak shift.
-r Full path to R loaded with Storey's q-value library, defaults to '/usr/bin/R'
file, see http://genomics.princeton.edu/storeylab/qvalue/
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds)
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
Advanced Options:
-o Don't remove overlapping exons, defaults to filtering gene annotation for overlaps.
-i Score introns instead of exons.
-f Scan for just enriched regions, defaults to look for both. Only use with chIP-Seq
datasets where the control is input. This turns on the empFDR estimation.
-d Treatment splice junction bed file(s) from the NovoalignParser, comma delimited,
full path.
-e Control splice junction bed file(s), comma delimited, full path.
-m Minimum number of reads in associated gene before scoring splice junctions.
Used in estimating the expected proportion of T and scaling the log2Ratio.
Defaults to 100.
-w Use read score probabilities (assumes scores are > 0 and <= 1), defaults to
assigning 1 to each read score. Experimental.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults -p 100 -b /Data/selectRegions.bed -f
**************************************************************************************
**************************************************************************************
** DRDS Annotator: January 2014 **
**************************************************************************************
This application annotates DefinedRegionDifferentialSeq xlsx files using Ensembl
biomart tab-delimited annotation files. By default, ensembl biomart output files will
list the Ensembl gene id in the first column and Ensembl transcript id in the second
column. This application assumes these defaults. It will match the gene id in the
first column of the biomart file to the name listed in the 'IGB HyperLink' column
found in the 'Analyzed Genes' tab of the DRDS xlxs output. All biomart columns after
the transcript id column are added to the output file. The data is inserted between
the 'Alt Name' and locus columns in the 'Analyzed Genes' tab.
The biomart output files can have multiple annotation lines for each gene id.
Currently, this app uses the first annotation line encountered.
Required Arguments:
-i Input file. Path to DRDS xlsx output file you wish to annotate
-a Annotation file. Path to biomart annotation file.
-o Annotated output file. Path to the annotated output file
Example: java -Xmx4G -jar pathTo/USeq/Apps/DRDSAnnotator -i geneStats.xlsx
-a mm10.biomart.txt -o geneStats.ann.xlsx
**************************************************************************************
**************************************************************************************
** Enriched Region Maker: July 2013 **
**************************************************************************************
ERM combines windows from ScanSeqs xxx.swi files into larger enriched or reduced
regions based on one or more scores. For each score index, you must provide a minimal
score. Adjacent windows that exceed the minimum score(s) are merged and the best
window scores applied to the region. If treatment and control PointData are provided,
the best 25bp peak within each region will be identified and each ER rescored. To
select for ERs with a 1% FDR and 2x enrichment above control, follow the example
assuming score indexes 1,2,4 correspond to QValFDR, EmpFDR, and
Log2Ratio. Note, if you are performing a static analysis comparing chIP vs chIP,
don't set thresholds on the EmpFDR, this was disabled and all of the values are zero.
To print descriptions of the score indexes, complete the command line and skip the
-i option. Lastly, FDRs and p-values are represented in USeq in a transformed state,
as -10Log10(FDR/p-val) where 13 = 5%, 20 = 1%, etc. To select for regions with an FDR of
less than 1% you would set a threshold of 20 for the QValFDR and, if running a static
analysis, the EmpFDR.
Options:
-f Full path file name for the serialized xxx.swi file from ScanSeqs, if a
directory is specified, all xxx.swi files will be processed.
-s Minimal score(s) one for each score idex, comma delimited, no spaces.
-i Score index(s) one for each minimum score.
Advanced Options:
-n Make a given number of ERs, one or more, comma delimited, no spaces. Uses score
index 0.
-m Multiply scores by -1 to make reduced regions instead of enriched regions.
-r Remove windows that intersect a list of regions. Enter a full path tab delimited
regions file text (chr start stop) Coordinates are assumed to be zero based and
stop inclusive. Useful for excluding regions from ER generation.
-b BP buffer to subtract and add to start and stops of regions used in filtering
intersecting windows, defaults to 0.
-e Exclude entire ERs that intersect the -r regions, defaults to removing windows.
This is more exclusive and will not simply punch holes in ERs but throw out
The entire ER.
-g Max gap, defaults to the size of the window used in ScanSeqs.
-t Provide treatment PointData directories, full path, comma delimited to ID the peak
center in each ER. These should contain the same unshifted stranded chromosome
specific xxx_-/+_.bar.zip files used in ScanSeqs.
-c Control PointData directories, ditto.
-p Full path to R, defaults to '/usr/bin/R', required for rescanning ERs.
-w Sub window size, defaults to 25bp.
Example: java -Xmx500M -jar pathTo/USeq/Apps/EnrichedRegionMaker -f /solexa/zeste.swi
-i 1,2,4 -s 20,20,1 -w 50
**************************************************************************************
**************************************************************************************
** Estimate Error Rates: Jan 2017 **
**************************************************************************************
EER scans an mpileup file looking for short windows of adjacent bps (default 7) where
1) each base exceeds a minimum read depth of high quality bases (>100)
2) shows little evidence of indels (<0.1), and
3) the fraction of poor quality bps isn't excessive (<0.5).
The non reference snv observations are then tabulated for the center base in each
window, if low (<0.1), they are assumed error and saved. For indel error calculations,
each bp is scored as above sans the indel filter and window requirement. Insertions
are counted once regardless of the size, where as deletions are counted for every base
affected. Run this app on samples where real snvs and indels are expected to have an
allele frequency of > ~0.5 , e.g. normal or pure single clone somatic.
Required Options:
-m Path to a normal sample mpileup file (gz/zip OK), 'samtools mpileup -B -q 20 -d
1000000 -f $fastaIndex -l $bedFile *.bam | gzip > mpileup.gz' Multiple samples
in the file are merged.
Default Options:
-b Minimum base quality, default 20
-r Minimum good base coverage, default 100
-i Maximum INDEL allele freq for snv counting, default 0.1
-n Maximum non reference allelic freq, default 0.1
-p Maximum failing base allele freq, default 0.5
-f Number flanking bp to define scorable region, default 3
-s Comma delimited list (zero is 1st sample, no spaces) of sample indexes to merge,
defaults to all.
-c File path to save a count table of parsed observations, defaults to none.
Example: java -Xmx4G -jar pathToUSeq/Apps/EstimateErrorRates -m normExo.mpileup.gz
-r 200 -i 0.15 -f 2 -s 0,3,4 -c countTable.txt
**************************************************************************************
**************************************************************************************
** Exact Bam Mixer : March 2019 **
**************************************************************************************
Combines bam alignment files in different fractions to simulate multiple variant
frequencies. Run BamBlaster first. Threaded, so provide almost all the memory available
to java. The ExactBamMixer attempts to create bam files containing variants will very
similar AFs. The BamMixer produces more of a spread of AFs.
Required:
-r Path to a directory to save the results
-u Path to the xxx_unmodified.bam from your BamBlaster run
-f Path to the xxx_filtered.bam from your BamBlaster run
-i Path to your realigned bam containing injected variants, merge the single and
paired end alignment files with MergeSams USeq app.
-v Path to the vcf file containing variants used to modify the BamBlaster alignments
Optional:
-m Fractions to mix in the variant alignments, comma delimited, no spaces, defaults to
0.025,0.05,0.1,0.2
-t Number of threads to use, defaults to all
-a Minimum number alt read pairs to include an injected variant in a particular mixed
bam, defaults to 2
Example: java -Xmx100G -jar pathTo/USeq/Apps/BamMixer -r ~/TumorSim/ -v inject.vcf
-u ~/bb_unmodified.bam -f ~/bb_filtered.bam -p ~/bb_paired.bam -s ~/bb_single.bam
**************************************************************************************
**************************************************************************************
** Export Exons Nov 2014 **
**************************************************************************************
EE takes a UCSC Gene table and prints the exons to a bed file.
Parameters:
-g Full path file text for the UCSC Gene table.
-a Expand the size of each exon by X bp, defaults to 0
-u Remove UTRs if present, defaults to including
-n Append exon numbers to the gene name field. This makes the bed file compatible
with DRDS
-f Export just 5' UTRs
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportExons -g /user/Jib/ucscPombe.txt
-a 50
**************************************************************************************
**************************************************************************************
** Export Intergenic Regions May 2007 **
**************************************************************************************
EIR takes a gff file and uses it to mask a boolean array. Parts of the boolean array
that are not masked are returned and represent integenic sequences. Be sure to put in
a gff line at the stop of each chromosome noting the last base so you caputure the last
intergenic region. (eg chr1 GeneDB lastBase 3600000 3600001 . + . lastBase). Base
coordinates are assumed to be stop inclusive, not interbase.
Parameters:
-g Full path file text for a gff file or directory containing such.
-t Base pairs to trim from the ends of each intergenic region, defaults to 0.
-m Minimum acceptable intergenic size, those smaller will be tossed, defaults to 60bp
-s Subtract one from the start and stop coordinates.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportIntergenicRegions -s -m 100 -g
/user/Jib/GffFiles/Pombe/sanger.gff
**************************************************************************************
**************************************************************************************
** Export Intronic Regions June 2007 **
**************************************************************************************
EIR takes a UCSC Gene table and fetches the most conservative/ smallest intronic
regions. Base coordinates are assumed to be stop inclusive, not interbase.
Parameters:
-g Full path file text for the UCSC Gene table.
-m Minimum acceptable intron size, those smaller will be tossed, defaults to 60bp
-s Subtract one from the stop coordinates of your UCSC table to convert from interbase.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportIntronicRegions -s -m 100 -g
/user/Jib/ucscPombe.txt
**************************************************************************************
**************************************************************************************
** Export Trimmed Genes May 2012 **
**************************************************************************************
EE takes a UCSC Gene table and clips each gene back to the first intron closed by a
coding sequence exon. Thus these include all of the 5'UTRs. Genes with no introns are
removed.
Parameters:
-g Full path file text for the UCSC Gene table.
-u Print just UTRs, defaults to UTRs plus 1st CDS intron with flanking exon.
-i Print just 1st CDS intron with flanking exons.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportTrimmedGenes -u -g
/user/Jib/ucscPombe.txt
**************************************************************************************
**************************************************************************************
** Fastq Barcode Tagger: August 2018 **
**************************************************************************************
Takes 2 or 3 fastq files (paired end reads and possibly a third containing unique
molecular barcodes/ indexes), appends the barcode and quality to the fastq header, and
writes out the modified records. For IDT inline 2 fastq UMI data sets, the barcode is
parsed from the beginning of each fastq. Be sure to clip 5Ns from the 3' end when
adapter trimming.
Options:
-f First fastq file, .gz/.zip OK.
-s Second fastq file, .gz/.zip OK.
-b Barcode fastq file, .gz/.zip OK, or set -e
-e Parse barcodes from the first 3bp of each read and combine the two 3mers into a
6mer barcode. 5bp are trimmed from the ends of each read to remove the UMI and
2bp constant seq as well as an potential read through. IDT's current strategy.
-i Write interlaced fastq to stdout for direct piping to other apps
-r Directory to save the modified fastqs, defaults to the parent of -f
-l Max length of barcode, defaults to all. Use to trim 3' end.
-a Append the line number to the read name to uniquify.
Example: java -Xmx1G -jar pathToUSeq/Apps/FastqBarcodeTagger -f lob_1.fastq.gz
-s lob_2.fastq.gz -b lob_barcode.fastq.gz -i | bwa mem -p /ref/hg19.fa
**************************************************************************************
**************************************************************************************
** Fastq Interlacer: April 2016 **
**************************************************************************************
Takes paired fastq files and writes interlaced/ interleaved fastq to stndOut.
Options:
-f First fastq file, .gz/.zip OK.
-s Second fastq file, .gz/.zip OK.
Example: java -Xmx1G -jar pathToUSeq/Apps/FastqInterlacer -f lob_1.fastq.gz
-s lob_2.fastq.gz | cutadapt | bwa | samblaster ....
**************************************************************************************
**************************************************************************************
** Fastq Renamer: April 2018 **
**************************************************************************************
Takes paired fastq files and replaces the header with the record count.
Options:
-f First fastq file, .gz/.zip OK.
-s Second fastq file, .gz/.zip OK.
-d Path to a directory for saving the modified fastq files.
Example: java -Xmx1G -jar pathToUSeq/Apps/FastqRenamer -f lob_1.fastq.gz
-s lob_2.fastq.gz -d UniquifiedFastq/
**************************************************************************************
**************************************************************************************
** FetchGenomicSequences: Feb 2013 **
**************************************************************************************
Given a file containing genomic coordinates, fetches and saves the sequence (column
output: chrom origStart origStop fetchedStart fetchedStop completeFetch seq).
-f Full path to a file or directory containing tab delimited chrom, start,
stop text files. Interbabase coordinates (zero based, stop excluded).
-s Full path directory text containing containing genomic fasta files. The fasta
header defines the name of the sequence, not the file name.
-b Fetch flanking bases, defaults to 0. Will set start to zero or stop to last base if
boundaries are exceeded.
-r Reverse complement fetched sequences, defaults to returning the + genomic strand.
-a Output fasta format.
Example: java -Xmx1000M -jar pathTo/T2/Apps/FetchGenomicSequences -f /data/miRNAs.txt
-s /genomes/human/v35.1/ -b 5000 -r
**************************************************************************************
**************************************************************************************
** Find Neighboring Genes: Nov 2008 **
**************************************************************************************
FNG takes a list of genes in UCSC Gene Table format and intersects them with a list of
regions finding the closest gene to each region as well as all of the genes that fall
within a given neighborhood. Distance is measured from the center of the region to the
transcription start site/ 1st base position in 1st exon. See Tables link under
http://genome.ucsc.edu/ . Note, output coordinates are zero based, stop inclusive.
-g Full path file text for a tab delimited UCSC Gene Table (text chrom strand txStart
txEnd cdsStart cdsEnd exonCount exonStarts exonEnds etc...) .
-p Full path file/directory text for tab delimited region list(s) (chr, start, stop) .
-b Size of neighborhood in bp, default is 10000
-f Find genes that overlap neighborhood irregardles of distance to TSS.
-c Only print closest genes.
-o Print neighbors on one line.
Example: java -jar pathTo/T2/Apps/FindNeighboringGenes -g /anno/hg17Ensembl.txt -p
/affy/p53/finalPicks.txt -b 5000 -c
**************************************************************************************
**************************************************************************************
** Find Overlapping Genes: Oct 2010 **
**************************************************************************************
Finds overlapping genes that converge, diverge, or contain one another given a UCSC
gene table.
Options:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds). NOTE:
this table should contain only one composite transcript per gene (e.g. Use
Ensembl genes NOT transcripts. See MergeUCSCGeneTable app.).
Example: java -Xmx4G -jar pathTo/USeq/Apps/FindOverlappingGenes -u
/data/zv8EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Find Shared Regions: Dec 2018 **
**************************************************************************************
Writes out a bed file of shared regions. Interbase coordinates.
Options:
-f First bed file (tab delimited: chr start stop ...).
-s Second bed file.
-r Results file.
-m Minimum length, defaults to 0.
Example: java -Xmx4G -jar pathTo/USeq/Apps/FindSharedRegions -f
/Res/firstBedFile.bed -s /Res/secondBedFile.bed -r /Res/common.bed -m 100
************************************************************************************
**************************************************************************************
** File Cross Filter: Sept 2017 **
**************************************************************************************
FCF takes one or more columns in the matcher file and uses these as a key to parse and
save matching keys in the to parse files. Use this to parse lines in files that match
those in another. Keys must be unique. The order and number of the rows in the matcher
file is preserved, if a match is not found in the parsed file, a blank line is inserted
instead.
-m Path to a tab delimited txt (.gz/.zip OK) file to use in matching.
-a One or more column indexs in the matcher file to use as the key.
-p Path to a file or directory of files to parse (.gz/.zip OK).
-b One or more column indexes in the parse file(s) to use as a key.
Example: java -jar pathTo/USeq/Apps/FileCrossFilter -m intRegions.bed -a 0,1,2
-p SpreadSheetData/ -b 0,1,2
**************************************************************************************
**************************************************************************************
** File Match Joiner: July 2008 **
**************************************************************************************
FMJ loads a file and a particular column containing unique entries, a key, and then
appends the key line to lines in the parsed file that match a particular column.
Usefull for appending say chromosome coordinates to snp ids data, etc.
-k Full path file text for a tab delimited txt file (key) containing unique entries.
-f Ditto but for the file to parse, can specify a directory too.
-i Collapse duplicate keys.
-j Skip duplicate keys.
-a Column index containing the unique IDs in key, defaults to 0.
-b Column index containing the unique IDs in parsers, defaults to 0.
-p Print only matches.
Example: java -jar pathTo/Apps/FileMatchJoiner -k /snpChromMap.txt -m /SNPData/
--b 2 -p
**************************************************************************************
**************************************************************************************
** File Joiner: Feb 2016 **
**************************************************************************************
Joins text files into a single file, avoiding line concatenations. This is a problem
with using 'cat * >> combine.txt'. Removes empty lines. Option to follow custom order.
Parameters:
-f Full path text for the directory containing the text files.
-o (Optional) Order the files using this comma delimited list, no spaces. Not all
need to exist.
-c (Optional) Concatinated results file.
Example: java -jar pathTo/T2/Apps/FileJoiner -f /affy/SplitFiles/
-o 1.fasta,2.fasta,3.fasta,4.fasta
**************************************************************************************
**************************************************************************************
** File Splitter: July 2010 **
**************************************************************************************
Splits a big text file into smaller files given a maximum number of lines.
Required Parameters:
-f Full path file text or directory for the text file(s) (.zip/.gz OK).
-n Maximum number of lines to place in each.
-g GZip split files.
Example: java -Xmx256M -jar pathTo/T2/FileSplitter -f /affy/bpmap.txt -n 50000
**************************************************************************************
**************************************************************************************
** Filter Intersecting Regions: Dec 2018 **
**************************************************************************************
Flattens the mask regions and uses it to split the split file(s) into intersecting
and non intersecting regions based on the minimum fraction intersection. For UCSC gene
tables, exons are compared and if any in the gene intersect, the whole gene is moved
accordingly.
Options:
-m Full path file text for the masking bed file (tab delim: chr start stop ...).
-s Full path file or directory containing bed, gtf/gff, or ucsc gene table files to
split.
-t Type of files to split, indicate: bed, gff, or ucsc
-i Minimum fraction of each split region required to score as an intersection with
the flattened mask, defaults to 1x10-1074
-b Expand start and stop of regions to mask by xxx bps, defaults to 0
Example: java -Xmx4000M -jar pathTo/Apps/FilterIntersectingRegions -i 0.5
-m /ArrayDesigns/repMskedDesign.bed -b /ArrayDesigns/ -t bed
************************************************************************************
**************************************************************************************
** Filter Point Data: May 2016 **
**************************************************************************************
FPD drops or saves observations from PointData that intersect a list of regions
(e.g. repeats, interrogated regions).
Options:
-p Point Data directories, full path, comma delimited. These should contain
chromosome specific xxx.bar.zip files.
-r Full path file text for a tab delimited text file containing regions to use in
filtering the intersecting data (chr start stop ..., interbase coordinates).
-i Select data that intersects the list of regions, defaults to selecting data that
doesn't intersect.
-a Acceptible intersection, fraction, defaults to 0.5
-n Just calculate the number of observations after filtering, don't save any data.
-f Save directory, defaults to derivative of parent.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/FilterPointData -p /data/PointData
-r /repeats/hg18RepeatMasker.bed -a 0.75
**************************************************************************************
**************************************************************************************
** Foundation Vcf Comparator: Nov 2018 **
**************************************************************************************
FVC compares a Foundation vcf generated with the FoundationXml2Vcf to a recalled vcf.
Exact recall vars are so noted and removed. Foundation vcf with no exact but one
overlapping record can be merged with -k. Be sure to vt normalize each before running.
Recall variants failing FILTER are not saved.
Options:
-f Path to a FoundationOne vcf file, see the FoundationXml2Vcf app.
-r Path to a recalled snv/indel vcf file.
-m Path to named vcf file for saving the results.
-c Append chr if absent in chromosome name.
-e Exclude Foundation ##contig header lines.
-k Attempt to merge Foundation records that overlap a recall and are the same type.
Defaults to printing both.
Example: java -Xmx2G -jar pathToUSeq/Apps/FoundationVcfComparator -f /F1/TRF145.vcf
-r /F1/TRF145_recall.vcf.gz -e -c -m /F1/TRF145_merged.vcf.gz -k
**************************************************************************************
**************************************************************************************
** Foundation Xml 2 Vcf: Nov 2018 **
**************************************************************************************
Attempts to parse xml foundation reports to vcf. This is an inprecise process with
some insertions, multi snv, and multi vars. VCF variants have not been normalized.
Consider left aligning and demultiplexing with vt. Remove PHI elements first with grep:
grep -vwE '(MRN|FullName|FirstName|LastName|ReportPDF)' TRF123.xml > clnTRF123.xml
Options:
-x Path to a FoundationOne xml report or directory containing such.
-s Path to a directory for saving the results.
-f Path to the reference fasta with xxx.fai index
-o Skip variants that clearly fail to convert, e.g. var seq doesn't match fasta.
Defaults to marking 'ci' in FILTER field.
Example: java -Xmx2G -jar pathToUSeq/Apps/FoundationXml2Vcf -x /F1/TRF145179.xml
-f /Ref/human_g1k_v37.fasta -s /F1/VCF/
**************************************************************************************
**************************************************************************************
** Freebayes VCF Parser: Mar 2017 **
**************************************************************************************
Parses Freebayes VCF files, filtering for read depth, allele frequency diff ratio, etc.
Inserts AF and DP into for the tumor sample into the INFO field. Changes the sample
order to Normal and Tumor and updates the #CHROM line. Put the tumor bam first when
calling freebayes.
Required Options:
-v Full path file or directory containing xxx.vcf(.gz/.zip OK) file(s).
-t Minimum tumor allele frequency (AF), defaults to 0.
-n Maximum normal AF, defaults to 1.
-u Minimum tumor alignment depth, defaults to 0.
-o Minimum normal alignment depth, defaults to 0.
-d Minimum T-N AF difference, defaults to 0.
-r Minimum T/N AF ratio, defaults to 0.
-p Remove non PASS filter field records.
Example: java -jar pathToUSeq/Apps/FreebayesVCFParser -v /VCFFiles/ -t 0.05 -n 0.5
-u 100 -o 20 -d 0.05 -r 2
**************************************************************************************
**************************************************************************************
** Gatk Called Segment Annotator: December 2018 **
**************************************************************************************
Annotates GATKs CallCopyRatioSegments output with denoised copy ratio and heterozygous
allele frequency data from the tumor and matched normal samples. Enables filtering
using these values to remove copy ratio calls with high normal background. Adds
intersecting gene names.
Required Options:
-r Results directory to save the passing and failing segments.
-s Called segment file from GATKs CallCopyRatioSegments app, e.g. xxx.called.seg
-t Tumor denoised copy ratio file, from GATKs DenoiseReadCounts app. Bgzip compress
and tabix index it with https://github.com/samtools/htslib :
grep -vE '(@|CONTIG)' tumor.cr.tsv > tumor.cr.txt
~/HTSLib/bgzip tumor.cr.txt
~/HTSLib/tabix -s 1 -b 2 -e 3 tumor.cr.txt.gz
-n Normal denoised copy ratio file, ditto.
-u Tumor allele frequency file, from GATKs ModelSegments app. Bgzip compress
and tabix index it with https://github.com/samtools/htslib :
grep -vE '(@|CONTIG)' gbm7.hets.tsv > gbm7.hets.txt
~/HTSLib/bgzip gbm7.hets.txt
~/HTSLib/tabix -s 1 -b 2 -e 2 gbm7.hets.txt.gz
-o Normal allele frequency file, ditto.
-g RefFlat UCSC gene file, run USeq's MergeUCSCGeneTable to collapse transcripts.
Default Options:
-c Minimum absolute tumor log2 copy ratio, defaults to 0.15
-x Maximum absolute normal log2 copy ratio, defaults to 0.5
-m Minimum absolute log2 TN ratio of copy ratios, defaults to 0.15
-a Maximum bp gap for intersecting a segment with a gene, defaults to 1000
Example: java -Xmx4G -jar pathTo/USeq/Apps/GatkCalledSegmentAnnotator -r AnnoResults/
-s gbm7.called.seg -t tumor.cr.txt.gz -n normal.cr.txt.gz -u gbm7.hets.txt.gz
-o gbm7.hets.normal.txt.gz -g ~/UCSC/hg38RefSeq_Merged.refFlat.gz -a 100
**************************************************************************************
**************************************************************************************
** Gatk Runner: March 2018 **
**************************************************************************************
Takes a bed file of target regions, splits it by the number of threads, writes out
each, executes the GATK Gatktype caller, and merges the results. Set the -Xmx to the
maximum available on the machine to enable correct cpu thread usage.
Options:
-r A regions bed file (chr, start, stop,...) to intersect, see
http://genome.ucsc.edu/FAQ/FAQformat#format1 , gz/zip OK.
-s Path to a directory for saving the results.
-t Number concurrent thread override. Sets itself based on the memory and cpus
available to the JVM.
-c GATK command to execute, see the example below, modify to match your enviroment.
Most resources require full paths. Don't set -o or -L
-l Use lowercased l for Lofreq compatability.
-b Add a -bamout argument and merge bam chunks.
Example: java -Xmx24G -jar pathToUSeq/Apps/GatkRunner -b -r /SS/targets.bed -s
/SS/HC/ -c 'java -Xmx4G -jar /SS/GenomeAnalysisTK.jar -T MuTect2
-R /SS/human_g1k_v37.fasta --dbsnp /SS/dbsnp_138.b37.vcf
--cosmic /SS/v76_GRCh37_CosmicCodingMuts.vcf.gz -I:tumor /SS/sar.bam -I:normal
/SS/normal.bam'
**************************************************************************************
**************************************************************************************
** GeneiASE Parser: Sept 2016 **
**************************************************************************************
Combines the GeneiASE results file with the input data file.
Required Options:
-r GeneiASE results output file
-d GeneiASE input data file
-o Output file for the summary spreadsheet
Example: java -Xmx4G -jar pathTo/USeq/Apps/AllelicExpressionDetector -b Bam/RPENormal/
-n D002-14,D005-14,D006-14,D009-14 -d GenotypingResults.txt.gz -s SNPMap_Ref2Alt_Int.txt
-r RPENormal -t ~/Anno/b37EnsGenes7Sept2016_Exons.bed.gz
**************************************************************************************
**************************************************************************************
** Graph 2 Bed: Feb 2011 **
**************************************************************************************
Converts USeq stair step and heat map graphs into region bed files using a threshold.
Do not use this with non USeq generated graphs. Won't work with bar or point graphs.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. May point this to a single directory
of such too.
-t Threshold, regions exceeding it will be saved, defaults to 0.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/Graph2Bed -t 9 -p /data/ReadCoverage
**************************************************************************************
**************************************************************************************
** Generate Overlas: Dec 2012 **
**************************************************************************************
Merges proper paired alignments that pass a variety of checks and thresholds. Only
unambiguous pairs will be merged. Increases base calling accuracy in overlap and helps
avoid non-independent variant observations and other double counting issues. Identical
overlapping bases are assigned the higher quality scores. Disagreements are resolved
toward the higher quality base. If too close in quality, then the quality is set to 0.
Be certain your input bam/sam file(s) are sorted by query name, NOT coordinate.
Options:
-f The full path file or directory containing raw xxx.sam(.gz/.zip OK)/.bam file(s)
paired alignments.
Multiple files will be merged.
Default Options:
-a Maximum alignment score (AS:i: tag). Defaults to 120, smaller numbers are more
stringent. Approx 30pts per mismatch for novoalignments.
-q Minimum mapping quality score, defaults to 13, larger numbers are more stringent.
Set to 0 if processing splice junction indexed RNASeq data.
-r The second paired alignment's strand is reversed. Defaults to not reversed.
-d Maximum acceptible base pair distance for merging, defaults to 5000.
-m Don't cross check read mate coordinates, needed for merging repeat matches. Defaults
to checking.
-l Output file name. Write merging statitics to file instead of standard output.
Example: java -Xmx1500M -jar pathToUSeq/Apps/MergePairedSamAlignments -f /Novo/Run7/
-c -s /Novo/STPParsedBams/run7.bam -d 10000
**************************************************************************************
**************************************************************************************
** Gr2Bar: Nov 2006 **
**************************************************************************************
Converts xxx.gr.zip files to chromosome specific bar files.
-f The full path directory/file text for your xxx.gr.zip file(s).
-v Genome version (ie H_sapiens_Mar_2006), get from UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases
-o Orientation of GR file. If not specified, orientation is left as '.'
Example: java -Xmx1500M -jar pathTo/T2/Apps/Gr2Bar -f /affy/GrFiles/ -v hg17
**************************************************************************************
**************************************************************************************
** Inosine Predict: Aug 2010 **
**************************************************************************************
IP estimates the likelihood of ADAR RNA editing using the multiplicative 4L,4R model
described in Eggington et. al. 2010.
Options:
-f Multi fasta file containing sequence(s) to score.
-m Maxtrix scoring file.
-p Print an example matrix.
-o Don't include the opposite strand.
-s Save directory, defaults to parent of the fasta file.
-z Name of a zip archive to create containing the results.
Example: java -Xmx2G -jar pathTo/USeq/Apps/InosinePredict -m
~/ADARMatrix/hADAR1-D.matrix.txt -f ~/SeqsToScore/candidates.fasta.gz
**************************************************************************************
**************************************************************************************
** Intersect Lists: Dec 2008 **
**************************************************************************************
IL intersects two lists (of genes) and using randomization, calculates the
significance of the intersection and the fold enrichment over random. Note, duplicate
items are filtered from each list prior to analysis.
-a Full path file text for list A (or directory containing), one item per line.
-b Full path file text for list B (or directory containing), one item per line.
-t The total number of unique items from which A and B were drawn.
-n Number of permutations, defaults to 1000.
-p Print the intersection sets (common, unique to A, unique to B) to screen.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectLists -a /Data/geneListA.txt -b
/Data/geneListB.txt -t 28356 -n 10000
**************************************************************************************
**************************************************************************************
** Intersect Key With Regions: July 2012 **
**************************************************************************************
IR intersects lists of genomicRegions (chrom start stop(inclusive)) with a key, assumes the
lists are sorted from most confident to least confident. Multiple hits to the same key
region are ignored.
-k Full path file text for the key genomicRegions file, tab delimited (chr start
stop(inclusive)).
-r Full path file text or directory containing your region files to score.
-g Max gap, defaults to -1. A max gap of 0 = genomicRegions must abut, negative values force
overlap (ie -1= 1bp overlap, be careful not to exceed the length of the smaller
region), positive values enable gaps (ie 1=1bp gap).
-s Subtract 1 from end coordinates. Use for interbase.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectKeyWithRegions -k /data/key.txt
-r /data/HitLists/
**************************************************************************************
**************************************************************************************
** Intersect Regions: May 2017 **
**************************************************************************************
IR intersects lists of regions (tab delimited: chrom start stop(inclusive)). Random
regions can also be used to calculate a p-value and fold enrichment.
-f First regions files, a single file, or a directory of files.
-s Second regions files, a single file, or a directory of files.
-g Max gap, defaults to 0. A max gap of 0 = regions must at least abut or overlap,
negative values force overlap (ie -1= 1bp overlap, be careful not to exceed the
length of the smaller region), positive values enable gaps (ie 1=1bp gap).
-e Score intersections where second regions are entirely contained by first regions.
-r Make random regions matched to the second regions file(s) and intersect with the
first. Enter either a bed file or full path directory that contains chromosome
specific interrogated regions files (ie named: chr1, chr2 ...: chrom start stop).
-c Match GC content of second regions file(s) when selecting random regions, rather
slow. Provide a full path directory text containing chromosome specific genomic
sequences.
-n Number of random region trials, defaults to 1000.
-w Write intersections and differences.
-x Write paired intersections.
-p Print length distribution histogram for gaps between first and closest second.
-q Parameters for histogram, comma delimited list, no spaces:
minimum length, maximum length, number of bins. Defaults to -100, 2400, 100.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectRegions -f /data/miRNAs.txt
-s /data/DroshaLists/ -g 500 -n 10000 -r /data/InterrogatedRegions/
**************************************************************************************
**************************************************************************************
** Joint Genotype VCF Parser: Oct 2018 **
**************************************************************************************
Splits and filters GATK joint genotyped multi sample vcf files. Use vt to decompose
the multi alts. See https://genome.sph.umich.edu/wiki/Vt#Decompose . Replaces the AF
and DP INFO fields with the sample level values.
Required Params:
-v Path to vt decomposed GATK joint genotyped multi sample vcf file, gz/zip OK.
~/BioApps/vt decompose -s jointGenotyped.vcf.gz -o jointGenotyped.decomp.vcf.gz
-s Path to a directory to save the split files.
Optional Params:
-q Minimum QUAL value, defaults to 20
-d Minimum read depth based on the AD sample values, defaults to 10
-a Minimum AF allele freq, defaults to 0.2
-g Minimum GT genotype quality, defaults to 20
-f Print debugging output to screen
Example: java -jar -Xmx2G pathToUSeq/Apps/HaplotypeVCFParser -d 20 -a 0.25 -g 30 -f
-v jointGenotyped.decomp.vcf.gz -s SplitFilteredVcfs/ -q 30
**************************************************************************************
**************************************************************************************
** Kegg Pathway Enrichment: Aug 2009 **
**************************************************************************************
KPE looks for overrepresentation of genes from a user's list in Kegg pathways using a
random permutation test. Several files are needed from http://www.genome.jp/kegg
Gene names must be in Ensembl Gene notation and begin with ENSG.
Options:
-e Full path file text for a KeggGeneIDs : EnsemblGeneIDs file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/genes/organisms/hsa/hsa_ensembl-hsa.list)
-p Full path file text for a KeggPathwayIDs : TextDescription file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/pathway/map_title.tab)
-g Full path file text for a KeggGeneIDs : KeggPathwayIDs file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/pathway/organisms/hsa/hsa_gene_map.tab)
-a Full path file text for your all interrogated Ensembl gene list (e.g. ENSG00...)
One gene per line.
-s Full path file text for your select gene list.
-n Number of random iterations, defaults to 10000
Example: java -Xmx1500M -jar pathTo/USeq/Apps/KeggPathwayEnrichment -e
/Kegg/hsa_ensembl-hsa.list -p /Kegg/map_title.tab -g /Kegg/hsa_gene_map.tab
-a /HCV/ensemblGenesWith20OrMoreReads.txt -s /HCV/upRegInHCV_Norm.txt
**************************************************************************************
**************************************************************************************
** Known Splice Junction Scanner : Sept 2017 **
**************************************************************************************
Scores know splice junctions using the MaxEntScan algorithms. See Yeo and
Burge 2004, http://www.ncbi.nlm.nih.gov/pubmed/15285897 for details.
Required Options:
-r Name of a gzipped bed file to use in saving the results, will over write.
-f Path to the reference fasta with associated xxx.fai index
-u UCSC RefFlat or RefSeq transcript (not merged genes) file, full path. See RefSeq
http://genome.ucsc.edu/cgi-bin/hgTables, (uniqueName1 name2(optional) chrom
strand txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 C1orf64 chr1 + 16203317
16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 .
-m Full path directory name containing the me2x3acc1-9, splice5sequences and me2x5
splice model files. See USeqDocumentation/splicemodels/ or
http://genes.mit.edu/burgelab/maxent/download/
Example: java -Xmx10G -jar ~/USeq/Apps/KnownSpliceJunctionScanner -f ~/Hg19/hg19.fasta
-r ~/exm2.bed.gz -m ~/USeq/Documentation/splicemodels -u ~/hg19EnsTrans.ucsc.zip
**************************************************************************************
**************************************************************************************
** Lofreq VCF Parser: May 2018 **
**************************************************************************************
Parses Lofreq vcf files with options for filtering for minimum QUAL, modifying the
FILTER field, removing non SNVs, and appending FORMAT info for downstream merging.
Required Params:
-v Full path file or directory containing xxx.vcf(.gz/.zip OK) file(s)
Optional Params:
-s File path to a directory for saving the modified vcfs
-m Minimum QUAL score, defaults to 0
-d Minimum DP read depth, defaults to 0
-t Minimum AF allele freq, defaults to 0
-r Minimum Alt count, defaults to 0
-i Remove non SNV records
-f Replace the FILTER field with '.'
-a Append FORMAT NORMAL TUMOR to #CHROM line and add empty columns to records
-n Mark variants failing thresholds FAIL instead of not printing
Example: java -jar pathToUSeq/Apps/LofreqVCFParser -v VCFFiles/ -m 32 -i -f -a
-s FilteredLofreqVcfs/ -r 3
**************************************************************************************
**************************************************************************************
** Maf Parser: Sept 2016 **
**************************************************************************************
Parses and manipulates variant maf files. Provide a path to the tabix executables
(https://github.com/samtools/htslib) for TQuery lookup and IGV compatibility.
Options:
-m Path to a xxx.maf file (xxx.maf.txt and .zip/.gz OK) or directory containing such.
-o Output directory, will overwrite.
-t To tabix index the output, provide a path to the dir containing bgzip and tabix
-c Convert M chroms to MT
Example: java -Xmx4G -jar pathTo/USeq/Apps/MafParser -m MafTCGAFiles/ -o Sorted/
-t ~/BioApps/HTSlib/1.3/bin/ -c
**************************************************************************************
**************************************************************************************
** Make Splice Junction Fasta: Nov 2010 **
**************************************************************************************
DEPRECIATED, don't use! See MakeTranscriptome app!
MSJF creates a multi fasta file containing sequences representing all possible linear
splice junctions. The header on each fasta is the chr_endPosExonA_startPosExonB. The
length of sequence collected from each junction is 2x the radius. A word of warning,
be very careful about the coordinate system used in the gene table to define the
start and stop of exons. UCSC uses interbase and this is assumed in this app. Check
a few of the junctions to be sure correct splices were made. All junction sequences
are from the top/ plus strand of the genome, they are not reverse complemented. Exon
sequence shorter than the radius will be appended with Ns.
Options:
-f Fasta file directory, should contain chromosome specific xxx.fasta files.
-u UCSC gene table file, full path. See, http://genome.ucsc.edu/cgi-bin/hgTables
-s Sequence length radius.
-r Results fasta file, full path.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MakeSpliceJunctionFasta -s 32
-f /Genomes/Hg18/Fastas/ -u /Anno/Hg18/ucscKnownGenes.txt -r
/Genomes/Hg18/Fastas/hg18_32_splices.fasta
************************************************************************************
**************************************************************************************
** Make Transcriptome: June 2012 **
**************************************************************************************
Takes a UCSC ref flat table of transcripts and generates two multi fasta files of
transcripts and splices (known and theoretical). All possible unique splice junctions
are created given the exons from each gene's transcripts. In some cases this is
computationally intractable and theoretical splices from these are not complete.
Read through occurs with small exons to the next up or downstream so keep the sequence
length radius to a minimum to reduce the number of junctions. Overlapping exons are
assumed to be mutually exclusive. All sequence is from the plus genomic stand, no
reverse complementation. Interbase coordinates. This app can take a very long time to
run. Break up gene table by chromosome and run on a cluster.
To incorporate additional splice-junctions, add a new annotation line containing two
exons representing the junction to the table. If needed, set the -s option to skip
duplicates.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-u UCSC RefFlat gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName transcriptName chrom strand
txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 ENST00000329454 chr1 +
16203317 16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 .
-r Sequence length radius. Set to the read length - 4bp.
-n Max number splices per transcript, defaults to 100000.
-m Max minutes to process each gene's splices before interrupting, defaults to 10.
-s Skip subsequent occurrences of splices with the same coordinates. Memory intensive.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MakeTranscriptome -f /Genomes/Hg18/Fastas/
-u /Anno/Hg18/ensemblGenes.txt.ucsc -r 46 -s
************************************************************************************
**************************************************************************************
** Mask Exons In Fasta Files: June 2011 **
**************************************************************************************
Replaces the exonic sequence with Ns.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-u UCSC RefFlat gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName transcriptName chrom strand
txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 ENST00000329454 chr1 +
16203317 16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 .
-s Save directory, full path.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MaskExonsInFastaFiles -f
/Genomes/Hg18/Fastas/ -u /Anno/Hg18/ensemblTranscripts.txt.ucsc -s
/Genomes/Hg18/MaskedFastas/
************************************************************************************
**************************************************************************************
** Mask Regions In Fasta Files: Aug 2016 **
**************************************************************************************
Replaces the region (or non region) sequence with Ns. Interbase coordinates.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-b Bed file of regions to mask.
-s Save directory, full path.
-r Mask sequence not in regions, reverse mask.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MaskRegionsInFastaFiles -f
/Genomes/Hg18/Fastas/ -b /Anno/Hg18/badRegions.bed -s
/Genomes/Hg18/MaskedFastas/
************************************************************************************
**************************************************************************************
** MatchMates: February 2019 **
**************************************************************************************
This app attaches mates of aligned first of pair reads to the attributes and modifies
the start position to enable sorting by unclipped start. Call Consensus to cluster and
collapse alignments with related molecular barcodes.
Options:
-s (Required) Provide a directory path for saving the modified alignments.
-b Path to a query name sorted bam/sam alignment file, defaults to reading from STDIN.
-j Write summary stats in json format to this file.
Example: myAligner | java -Xmx2G -jar pathTo/USeq/Apps/MatchMates -s ReadyForConsensus
**************************************************************************************
**************************************************************************************
** MaxEntScanScore3: Nov 2013 **
**************************************************************************************
Implementation of Max Ent Scan's score3 algorithm for human splice site detection. See
Yeo and Burge 2004, http://www.ncbi.nlm.nih.gov/pubmed/15285897
Options:
-s Full path directory name containing the me2x3acc1-9 splice model files. See
USeq/Documentation/ or http://genes.mit.edu/burgelab/maxent/download/
-t Full path file name for 23mer test sequences, GATCgatc only, one per line. Fasta OK.
Example: java -Xmx10G -jar pathTo/USeq/Apps/MaxEntScanScore3 -s ~/MES/splicemodels -t
~/MES/seqsToTest.fasta
**************************************************************************************
**************************************************************************************
** MaxEntScanScore5: Nov 2013 **
**************************************************************************************
Implementation of Max Ent Scan's score5 algorithm for human splice site detection. See
Yeo and Burge 2004, http://www.ncbi.nlm.nih.gov/pubmed/15285897
Options:
-s Full path directory containing the splice5sequences and me2x5 splice model files.
See USeq/Documentation/ or http://genes.mit.edu/burgelab/maxent/download/
-t Full path file name for 9mer test sequences, GATCgatc only, one per line. Fasta OK.
Example: java -Xmx10G -jar pathTo/USeq/Apps/MaxEntScanScore5 -s ~/MES/splicemodels -t
~/MES/seqsToTest.fasta
**************************************************************************************
**************************************************************************************
** Merge Adjacent Regions: Oct 2018 **
**************************************************************************************
Merges regions within a max bp gap and tracks the number merged. Regions must not
overlap. Best run the MergeRegions app if in doubt.
Options:
-b Path to a bed file of non overlapping regions, xxx.gz/.zip OK.
-r Path for saving the merged xxx.bed.gz file.
-m Max bp gap, defaults to 5000.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MergeAdjacentRegions -b myRegions.bed.zip
-m 1000 -r mergedRegions.bed.gz
**************************************************************************************
**************************************************************************************
** MergeExonMetrics : June 2013 **
**************************************************************************************
This app simply merges the output from several metrics html files.
Required:
-f Directory containing metrics dictionary files and a image directory
-o Name of the combined metrics file
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MergeExonMetrics -f metrics -o 9908_metrics
**************************************************************************************
**************************************************************************************
** Merge Overlappng Genes: Feb 2015 **
**************************************************************************************
Merges transcript models that share exonic bps. Maximizes exons, minimizes introns.
Assumes interbase coordinates.
Options:
-u Path to a UCSC RefFlat or RefSeq gene table file or directory with such to merge.
See http://genome.ucsc.edu/cgi-bin/hgTables, (geneName name2(optional) chrom
strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds).
-r Path for results file.
-m Minimum fraction exonic bp overlap for merging, defaults to 0.05
Example: java -Xmx4G -jar pathTo/USeq/Apps/MergeOverlappingGenes -d
/CufflinkTranscripts/zv9Genes.ucsc.gz -f 0.25 -r merged.ucsc
**************************************************************************************
**************************************************************************************
** Merge Paired Alignments: Oct 2018 **
**************************************************************************************
Merges proper paired alignments that pass a variety of checks and thresholds. Only
unambiguous pairs will be merged. Increases base calling accuracy in overlap and helps
avoid non-independent variant observations and other double counting issues. Identical
overlapping bases are assigned the higher quality scores. Disagreements are resolved
toward the higher quality base. If too close in quality, then the quality is set to 0.
Options:
-b Path to a coordinate sorted xxx.bam file containing paired alignments.
-d Path to a directory for saving the results.
Default Options:
-s Save merged xxx.sam.gz alignments instead of binary ChromData. Either works
in Sam2USeq for read coverage analysis, the ChromData is much faster.
-e Only process and save alignments overlapping this bed format region file.
-u Remove all alignments marked as duplicates, defaults to keeping.
-a Maximum alignment score (AS:i: tag). Defaults to 300, smaller numbers are more
stringent for novoalign where each mismatch is ~30pts.
-q Minimum mapping quality score, defaults to 0, larger numbers are more stringent.
-r The second paired alignment's strand has been reversed. Defaults to not reversed.
-i Maximum acceptible base pair distance for merging, defaults to 5000.
-m Don't cross check read mate coordinates, needed for merging repeat matches. Defaults
to checking.
-o Merge all proper paired alignments. Defaults to only merging those that overlap.
-p Don't print detailed paired alignment statistics and insert size histogram.
-t Number concurrent threads to run, defaults to the max available to the jvm.
-j Write summary stats in json format to this file.
Example: java -Xmx20G -jar pathToUSeq/Apps/MergePairedBamAlignments -f /Bams/ms.bam
-p -s /Bams/MergedPairs/ms.mergedPairs.sam.gz -d 10000
**************************************************************************************
**************************************************************************************
** Merge Point Data: Jan 2011 **
**************************************************************************************
Efficiently merges PointData, collapsing by position and possibly strand. Identical
position scores are either summed or converted into counts. DO NOT use this app on
PointData that will be part of a primary chIP/RNA-seq analysis. It is only for
bis-seq and visualization purposes.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. Alternatively, provide one directory
containing multiple PointData directories.
-s Save directory, full path.
-c Don't replace scores with hit count, just sum existing scores.
-m Merge strands
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MergePointData -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedEts1 -m
**************************************************************************************
**************************************************************************************
** Merge Regions: July 2017 **
**************************************************************************************
Flattens tab delimited bed files (chr start stop ...). Assumes interbase coordinates.
Options:
-d Directory containing bed files.
Example: java -Xmx4000M -jar pathTo/Apps/MergeRegions -d /Anno/TilingDesign/
************************************************************************************
**************************************************************************************
** Merge Sams: May 2017 **
**************************************************************************************
Merges sam and bam files. Adds a consensus header if one is not provided. These may
not work with GATK or Picard downstream apps, good for USeq.
Options:
-d The full path to a directory containing xxx.bam or xxx.sam.gz files to merged.
Default Options:
-s Save file, must end in xxx.bam, defaults merge.bam in -d.
-a Maximum alignment score. Defaults to 300, smaller numbers are more stringent.
Approx 30pts per mismatch.
-m Minimum mapping quality score, defaults to 0 (no filtering), larger numbers are
more stringent. Set to 13 or more to require near unique alignments. DO NOT set
for alignments parsed by the SamTranscriptomeParser!
-f Save reads failing filters, defaults to tossing them.
-h Full path to a txt file containing a sam header, defaults to autogenerating the
header from the sam/bam headers.
-t Don't delete temp xxx.sam.gz file.
-p Add program arguments to header, defaults to deleting, note duplicate cause Picard
apps to fail.
-q Quiet, print only errors.
Example: java -Xmx1500M -jar pathToUSeq/Apps/MergeSams -f /Novo/Run7/
-m 20 -a 120
**************************************************************************************
**************************************************************************************
** Merge UCSC Gene Table: Aug 2018 **
**************************************************************************************
Merges transcript models that share the same gene name (in column 0). Maximizes exons,
minimizes introns. Assumes interbase coordinates.
Options:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds).
Example: java -Xmx4G -jar pathTo/USeq/Apps/MergeUCSCGeneTable -u
/data/zv8EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Methylation Array Scanner: March 2014 **
**************************************************************************************
MAS takes paired or non-paired sample PointData representing beta values (0-1) from
arrays and scores regions with enriched/ reduced signal using a sliding window
approach. A B&H corrected Wilcoxon signed rank (or rank sum test for non-paired),
pseudo median of the log2(treat/control) ratios (or log2(pseT/pseC) for non-paired),
and permutation test FDR is calculated for each window. Use the EnrichedRegionMaker
to identify enriched and reduced regions by picking thresholds (e.g. -i 0,1 -s 0.2,13).
MAS generates several data tracks for visualization in IGB including paired sample bp
log2 ratios, window level Wilcoxon FDRs, and window level pseudomedian log2 ratios.
Note, non-paired analysis are very underpowered and require > 30 obs/ window to see
any significant FDRs.
Required Options:
-s Path to a directory for saving the results.
-d Path to a directory containing individual sample PointData directories, each of
which should contain chromosome split bar files (e.g. chr1.bar, chr2.bar, ...)
-t Names of the treatment sample directories in -d, comma delimited, no spaces.
-c Ditto but for the control samples, the ordering is critical and describes how to
pair the samples for a paired analysis.
Advanced Options:
-n Run a non-paired analysis where t and c are treated as groups and pooled.
-w Window size, defaults to 1000.
-o Minimum number observations in window, defaults to 10.
-p Minimum pseudomedian log2 ratio for estimating the permutation FDR, defaults to 0.2
-r Number permutations, defaults to 5
-e Run T-Test instead of Wilcoxon rank sum test for non-paired samples.
-v Save coefficient of variantion tracks
Example: java -Xmx4G -jar pathTo/USeq/Apps/MethylationArrayScanner -s ~/MAS/Res
-d ~/MAS/Bar/ -t Early1,Early2,Early3 -c Late1,Late2,Late3
-w 1500
**************************************************************************************
**************************************************************************************
** Methylation Array Defined Region Scanner: July 2013 **
**************************************************************************************
MADRS takes paired sample PointData representing beta values (0-1) from arrays and
a list of regions to score for differential methylation using a B&H corrected Wilcoxon
signed rank test and pseudo median of the paired log2(treat/control) ratios. Pairs
containing a zero value are ignored. It generates a spreadsheet of statistics for each
region. If a non-paired analysis is selected, a Wilcoxon rank sum test and
log2(pseT/pseC) are calculated on each region. Note this is a very underpowered test
requiring >30 observations to see any significant FDRs.
Required Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-d Path to a directory containing individual sample PointData directories, each of
which should contain chromosome split bar files (e.g. chr1.bar, chr2.bar, ...)
-t Names of the treatment sample directories in -d, comma delimited, no spaces.
-c Ditto but for the control samples, the ordering is critical and describes how to
pair the samples for a paired analysis.
-o Minimum number paired observations in window, defaults to 3.
-z Skip printing regions with less than minimum observations.
-n Run a non-paired analysis where t and c are treated as groups and pooled. Uneven
numbers of t and c are allowed.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MethylationArrayDefinedRegionScanner
-v H_sapiens_Feb_2009 -d ~/MASS/Bar/ -t Early1,Early2,Early3
-c Late1,Late2,Late3
**************************************************************************************
**************************************************************************************
** Microsatellite Counter: Jan 2014 **
**************************************************************************************
MicrosatelliteCounter identifies and counts microsatellite repeats in MiSeq fastq
files. This iteration of the software requires you to specify the primers used in the
sequencing project. It will automatically find the most likely microsatellite by
looking at all possible repeats of length 1 through length 10 and finding the longest
repeat by length, not repeat unit. There are two output files generated, the first
lists primer statistics (currently only reads with both primers are used), the
second lists repeat data. Note that the input file are fastq sequence that were
merged using a program like PEAR
Required Arguments:
-f Merged fastq file. Path to merged fastq file. We currently suggest using PEAR to
merge fastq sequences.
-p Primer file. Path to primer reference file. This file lists each primer used in
in the sequencing project in the format NAME
**************************************************************************************
** MiRNA Correlator: March 2014 **
**************************************************************************************
Generates a spreadsheet to use in comparing changing miRNA levels to changes in gene
expression.
Options:
-r Results file.
-a All miRNA name file (single column of miRNA names).
-m MiRNA data (two columns: miRNA name, miRNA log2Rto).
-t Gene target to miRNA data (two columns: gene target name, miRNA name).
-e Gene expression data (three columns: gene name, log2Rto, FDR).
-f Don't print the gene expression FDR value in the spreadsheet.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MiRNACorrelator -m miRNA_CLvsMOR.txt -a
allMiRNANamesNoPs.txt -t targetGene2MiRNA.txt -e geneExp_CLvsMOR.txt -r results.xls
**************************************************************************************
**************************************************************************************
** MpileUp Parser: Sept 2015 **
**************************************************************************************
Parses a SAMTools mpileup output file for non reference bases generating bed files and
data tracks with information related to error prone bases. Multiple samples are merged.
Options:
-p Path to a mpileup file (.gz or.zip OK, use 'samtools mpileup -Q 20 -A -B *bam').
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-s Save directory, full path, defaults to pileup file directory.
-r Minimum read coverage, defaults to 15.
-e Max nonRef base fraction, defaults to 0.05
-w Window size, defaults to 50
-f Max fraction failing bp in window, defaults to 0.05
Example: java -Xmx4G -jar pathTo/USeq/Apps/MpileupParser -p /Pileups/N2.mpileup.gz -v
H_sapiens_Feb_2009 -e 0.1 -w 25
**************************************************************************************
**************************************************************************************
** Mpileup Randomizer: May 2018 **
**************************************************************************************
Upon finding a gap in the coverage, the sample order is randomized and maintained. Use
this app to 'de-identify' a multi sample mpileup file while maintaining INDEL blocks.
Required Options:
-m Path to a Samtools mpileup file (gz/zip OK).
Default Options:
-r Minimum read depth to pass a sample, default 10
-s Minimum number of samples that must pass to save line, default 3
-g Minimum gap, defaults to 125
Example: java -Xmx4G -jar pathToUSeq/Apps/MpileupRandomizer -m normExo.mpileup.gz
-r 20 -s 4
**************************************************************************************
**************************************************************************************
** Multiple Replica Scan Seqs: May 2014 **
**************************************************************************************
MRSS uses a sliding window and Ander's DESeq negative binomial pvalue -> Benjamini &
Hochberg AdjP statistics to identify enriched and reduced regions in a genome. Both
treatment and control PointData sets are required, one or more biological replicas.
MRSS generates window level differential count tracks for the AdjP and normalized
log2Ratio as well as a binary window objec xxx.swi file for downstream use by the
EnrichedRegionMaker. MRSS also makes use of DESeq's variance corrected count data to
cluster your biological replics. Given R's poor memory management, running DESeq
requires lots of RAM, 64bit R, and 1-3 hrs.
Options:
-s Save directory, full path.
-t Treatment replica PointData directories, full path, comma delimited, no spaces,
one per biological replica. Use the PointDataManipulator app to merge same
replica and technical replica datasets. Each directory should contain stranded
chromosome specific xxx_-/+_.bar.zip files. Alternatively, provide one
directory that contains multiple biological replical PointData directories.
-c Control replica PointData directories, ditto.
-r Full path to 64bit R loaded with DESeq library, defaults to '/usr/bin/R' file, see
http://www-huber.embl.de/users/anders/DESeq/ . Type 'library(DESeq)' in
an R terminal to see if it is installed.
-p Peak shift, average distance between + and - strand peaks for chIP-Seq data, see
PeakShiftFinder or set it to 100bp. For RNA-Seq set it to 0. It will be used
to shift the PointData by 1/2 the peak shift.
-w Window size, defaults to the peak shift. For chIP-Seq data, a good alternative
is the peak shift plus the standard deviation, see the PeakShiftFinder app.
For RNA-Seq data, set this to 100-250.
Advanced Options:
-m Minimum number of reads in a window, defaults to 15
-d Don't delete temp files
Example: java -Xmx4G -jar pathTo/USeq/Apps/MultipleReplicaScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults/ -p 150 -w 250 -b
**************************************************************************************
**************************************************************************************
** Multi Sample VCF Filter : July 2015 **
**************************************************************************************
Filters a vcf file containing multiple sample records into those that pass or fail the
tests below. This works with VCFv4.1 files created by the GATK package. Note, the
records are not modified. If the number of records in the VCF file is greater than
500000, the VCF file is intersected in chunks.
Required:
-v Full path to a sorted single or multi sample vcf file (xxx.vcf/xxx.vcf.gz)). Note,
Java often fails to parse tabix compressed vcf files. Best to uncompress.
Optional:
-p Full path to an output VCF (xxx.vcf or xxx.vcf.gz). Specifying xxx.vcf.gz will
compress and index the VCF using tabix (set -t too). Defaults to input_Filt.vcf
-f Print out failing records, defaults to printing those passing the filters.
-a Fail records where no sample passes the sample thresholds.
-i Fail records where the original FILTER field is not 'PASS' or '.'
-c Fail records that don't intersect the regions in this bed file, full path.
-b Filter by genotype flags. -n, -u and -l must be set.
-n Sample names ordered by category.
-u Number of samples in each category.
-l Requirement flags for each category. All samples that pass the specfied filters
must meet the flag requirements, or the variant isn't reported. At least one
sample in each group must pass the specified filters, or the variant isn't
reported.
a) 'W' : homozygous common
b) 'H' : heterozygous
c) 'M' : homozygous rare
d) '-W' : not homozygous common
e) '-H' : not heterozygous
f) '-M' : not homozygous rare
-e Strict genotype matching. If this is selected, records with no-call samples
or samples falling below either minimum sample genotype quality (-g) or
minimum sample read depth (-r) won't be reported. Only samples listed in (-n)
will be checked
-d Minimum record QUAL score, defaults to 0, recommend >=20
-g Minimum sample genotype quality GQ, defaults to 0, recommend >= 20
-r Minimum sample read depth DP, defaults to 0, recommend >=10
-x Maximum sample read depth DP, defaults to unlimited
-y Minimum sample allele count read depth AD or DP4, defaults to 0
-s Print sample names and exit.
-t Path to tabix.
Example: java -Xmx10G -jar pathTo/USeq/Apps/MultiSampleVCFFilter
-v DEMO.passing.vcf -p DEMO.intersection.vcf -c exomeV4.bed -b
-n SRR504516,SRR776598,SRR504515,SRR504517,SRR504483 -u 2,2,1 -l M,H,-M
**************************************************************************************
**************************************************************************************
** Mutect VCF Parser: May 2018 **
**************************************************************************************
Parses Mutect2 VCF files, filtering for read depth, allele frequency diff ratio, etc.
Inserts AF and DP into for the tumor sample into the INFO field. Changes the sample
order to Normal and Tumor and updates the #CHROM line. Replaces the QUAL with TLOD.
Options:
-v Full path file or directory containing xxx.vcf(.gz/.zip OK) file(s).
-f Directory to save the parsed files, defaults to the parent dir of the first vcf.
-t Minimum tumor allele frequency (AF), defaults to 0.
-n Maximum normal AF, defaults to 1.
-u Minimum tumor alignment depth, defaults to 0.
-a Minimum tumor alt count, defaults to 0.
-o Minimum normal alignment depth, defaults to 0.
-d Minimum T-N AF difference, defaults to 0.
-r Minimum T/N AF ratio, defaults to 0.
-t Minimum TLOD score, defaults to 0.
-p Remove non PASS filter field records.
-s Print spreadsheet variant summary.
Example: java -jar pathToUSeq/Apps/MutectVCFParser -v /VCFFiles/ -t 0.05 -n 0.5 -u 100
-o 20 -d 0.05 -r 2 -a 3
**************************************************************************************
**************************************************************************************
** Mutect 4 VCF Parser: Oct 2018 **
**************************************************************************************
Parses Mutect2 VCF files from the GATK 4.0+ package, filtering for read depth, allele
frequency diff ratio, etc. Inserts AF and DP into for the tumor sample into the INFO
field. Replaces the QUAL with TLOD.
Options:
-v Full path file or directory containing xxx.vcf(.gz/.zip OK) file(s). It is REQUIRED
to run 'vt decompose -s ' on these first. Recommend running decompose_blocksub
too. See https://github.com/atks/vt
-f Directory to save the parsed files, defaults to the parent dir of the first vcf.
-t Minimum tumor allele frequency (AF), defaults to 0.
-n Maximum normal AF, defaults to 1.
-u Minimum tumor alignment depth, defaults to 0.
-a Minimum tumor alt count, defaults to 0.
-o Minimum normal alignment depth, defaults to 0.
-d Minimum T-N AF difference, defaults to 0.
-r Minimum T/N AF ratio, defaults to 0.
-t Minimum TLOD score, defaults to 0.
-p Remove non PASS filter field records.
Example: java -jar pathToUSeq/Apps/Mutect4VCFParser -v /VCFFiles/ -t 0.05 -n 0.5 -u 100
-o 20 -d 0.05 -r 2 -a 3
**************************************************************************************
**************************************************************************************
** Non Reference Region Maker: Jan 2018 **
**************************************************************************************
NRRM scans a single sample mpileup file looking for non reference base pairs. If these
pass read depth, allele frequency, and non ref base count thresholds, the base is
written to a bed file. BPs with insertions are saved as a 2 BP region. Run MergeRegions
or MergeAdjacentRegions to join proximal non ref BPs.
Options:
-m Provide a path to a single sample samtools mpileup file or pipe mpileup output.
-b Path to write the bed file output, should end in xxx.bed.gz
-r Minimum read depth, 10
-a Minimum non reference allelic frequency (SNVs + INDELS), default 0.05
-c Minimum non reference base count, default 3
-q Minimum base quality for inclusion in AF calculation, default 10
Example: samtools mpileup -B -d 1000000 -f $faIndex -l $bed $bam | java
-Xmx4G -jar pathToUSeq/Apps/NonReferenceRegionMaker -q 13 -r 20 -a 0.025 -b
0.025normExoNonRefMask.bed.gz -c 4
**************************************************************************************
**************************************************************************************
** Novoalign Bisulfite Parser: May 2016 **
**************************************************************************************
Parses Novoalign -b2 and -b4 single and paired bisulfite sequence alignment files into
PointData file formats. Generates several summary statistics on converted and non-
converted C contexts. Flattens overlapping reads in a pair to call consensus bps.
Note: for paired read RNA-Seq data run through the SamTranscriptomeParser first.
Options:
-a Alignment file or directory containing non merged novoalignments in SAM/BAM
(xxx.sam(.zip/.gz OK) or xxx.bam) format. Multiple files are combine.
-f Fasta file directory, chromosome specific xxx.fa/.fasta(.zip/.gz OK) files.
-s Save directory.
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Default Options:
-p Print bed file parsed data.
-x Maximum alignment score. Defaults to 300, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. For RNASeq data, set this to 0.
-b Minimum base quality score for reporting a non/converted C, defaults to 13.
-c Minimum base quality score for reporting a overlapping non/converted C not found
in the other pair, defaults to 13.
-d Remove duplicate reads prior to generating PointData. Defaults to not removing
duplicates.
Example: java -Xmx25G -jar pathToUSeq/Apps/NovoalignBisulfiteParser -x 240 -a
/Novo/Run7/ -f /Genomes/Hg19/Fastas/ -v H_sapiens_Feb_2009 -s /Novo/Run7/NBP
**************************************************************************************
**************************************************************************************
** Novoalign Indel Parser: June 2010 **
**************************************************************************************
Parses Novoalign alignment xxx.txt(.zip/.gz) files for consensus indels, something
currently not supported by the maq apps. Generates a consensus indel allele file,
interbase coordinates, for running through the Alleler application. Also creates two
bed files for the insertions and deletions.
Options:
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-r Full path directory for saving the results.
-p Minimum alignment posterior probability (-10Log10(prob)) of being incorrect,
defaults to 13 (0.05). Larger numbers are more stringent.
-b Minimum effected indel base quality score(s), ditto, defaults to 13.
-u Minimum number of unique reads covering indel, defaults to 2.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignIndelParser -f /Novo/Run7/
-r /Novo/Run7/indelAlleleTable.txt -p 20 -b 20 -u 3
**************************************************************************************
**************************************************************************************
** Novoalign Parser: Jan 2011 **
**************************************************************************************
Parses Novoalign xxx.txt(.zip/.gz) files into center position binary PointData xxx.bar
files, xxx.bed files, and if appropriate, a splice junction bed file. For the later,
create a gene regions bed file and run it through the MergeRegions application to
collapse overlapping transcripts. We recommend using the following settings while
running Novoalign 'novoalign -r0.2 -q5 -d yourDataBase -f your_prb.txt | grep '>chr' >
yourResultsFile.txt'. NP works with native, colorspace, and miRNA novoalignments.
Options:
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-r Full path directory text for saving the results.
-p Posterior probability threshold (-10Log10(prob)) of being incorrect, defaults to 13
(0.05). Larger numbers are more stringent. The parsed scores are delogged and
converted to 1-prob.
-q Alignment score threshold, smaller numbers are more stringent, defaults to 60
-c Chromosome prefix, defaults to '>chr'.
-i Ignore strand when making splice junctions.
-g (Optional) Full path gene region bed file (chr start stop...) containing gene
regions to use in scaling intersecting splice junctions.
-s Just print alignment stats, don't save any data.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignParser -f /Novo/Run7/
-v H_sapiens_Mar_2006 -p 20 -q 30 -r /Novo/Run7/mRNASeq/ -i -g
/Anno/Hg18/mergedUCSCKnownGenes.bed
**************************************************************************************
**************************************************************************************
** Novoalign Paired Parser: January 2009 **
**************************************************************************************
Parses Novoalign paired alignment files xxx.txt(.zip/.gz) into xxx.bed format.
Options:
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-e Exclude half matches with a high quality unmatched pair, defaults to keeping them.
-m Maximum size for paired reads mapping to the same chromosome, defaults to 100000.
-s Splice junction radius, defaults to 34. See the MakeSpliceJunctionFasta app.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignPairedParser -f /Novo/Run7/
**************************************************************************************
**************************************************************************************
** Oligo Tiler: Oct 2009 **
**************************************************************************************
OT tiles oligos across genomic regions returning their forward and reverse sequences.
Won't tile oligos with non GATC characters, case insensitive. Replaces non GATC chars
in offset regions with 'a'. Note, the defaults are set for generating a 60 mer Agilent
specific tiling microarray design where the first 10bp of the 3' stop are buried in the
matrix and the effective oligo length is 50bp. Adjust accordingly for other platforms.
Options:
-f Fasta file directory, should contain chromosome specific xxx.fasta files.
-r Regions file to tile (tab delimited: chr start stop ...) interbase coordinates.
-o Effective oligo size, defaults to 50.
-s Spacing to place oligos, defaults to 25.
-t Three prime offset, defaults to 10.
-m Minimum size of region to tile, defaults to 20.
-a Print oligo FASTA instead of an Agilent eArray text seq formatted results.
-c Tile CpG (spacing not used, see max gap option).
-g Max gap between adjacent CpGs to include in same oligo, defaults to 8.
-e Split export files by strand instead of alternating strand.
-b Replace 3' stop of oligos with the human 11-nullomer 'ccgatacgtcg'. The first
~10bp don't contribute to hybridization on Agilent arrays.
Example: java -Xmx4000M -jar pathTo/Apps/OligoTiler -s 40 -f /Genomes/Hg18/Fastas/
-r /Designs/cancerArray.bed -p -a
************************************************************************************
**************************************************************************************
** Overdispersed Region Scan Seqs: May 2012 **
**************************************************************************************
WARNING: this application is depreciated and no longer maintained, use the
DefinedRegionDifferentialSeq app instead!
ORSS takes bam alignment files and extracts reads under each region or gene's exons to
calculate several statistics. Makes use of Simon Anders' DESeq R package to with its
negative binomial p-value test to control for overdispersion. A Benjamini-Hochberg FDR
correction is used to control for multiple testing. DESeq is run with and without
variance outlier filtering. A chi-square test of independence between the exon read
count distributions is used to score alternative splicing. Several read count measures
are provided including counts for each replica, FPKMs (# frags per kb of int region
per total mill mapped reads) as well as DESeq's variance adjusted counts(use these for
clustering, correlation, and other distance type analysis). If replicas are provided
either the smallest all pair log2Ratio is reported (default) or the pseudomedian.
Several results files are written: two spread sheets containing all of the genes,
those that pass the thresholds, as well as egr, bed12, and useq region files for
visualization in genome browsers.
Required Options:
-s Save directory.
-t Treatment directory containing one xxx.bam file with xxx.bai index per biological
replica. The BAM files should be sorted by coordinate and have passed Picard
validation. Use the SamTranscriptomeParser to convert your aligned transcriptome
data to genomic coordinates.
-c Control directory, ditto.
-u UCSC RefFlat or RefSeq Gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds). WARNING!!!!!!
This table should contain only one composite transcript per gene. Use the
MergeUCSCGeneTable app to collapse Ensembl transcripts downloaded from UCSC in
RefFlat format.
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
-v Versioned Genome (ie H_sapiens_Mar_2006, D_rerio_Jul_2010), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Advanced/ Default Options:
-o Don't remove overlapping exons, defaults to filtering gene annotation for overlaps.
-i Score introns instead of exons.
-a Data is stranded. Only collect reads from the same strand as the annotation.
-f Minimum FDR threshold, defaults to 10 (-10Log10(FDR=0.1))
-l Minimum absolute log2 ratio threshold, defaults to 1 (2x)
-e Minimum number mapping reads per region, defaults to 20
-d Don't delete temp files used by DESeq
-p Use a pseudo median log2 ratio in place of the smallest all pair log2 ratios for
scoring the degree of differential expression when replicas are present.
Recommended for experiments with 4 or more replicas.
-r Full path to R loaded with DESeq library, defaults to '/usr/bin/R' file, see
http://www-huber.embl.de/users/anders/DESeq/ . Type 'library(DESeq)' in
an R terminal to see if it is installed.
Example: java -Xmx4G -jar pathTo/USeq/Apps/OverdispersedRegionScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults/ -f 30 -e 30 -u /Anno/mergedZv9EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Create Exon Summary Metrics : April 2013 **
**************************************************************************************
This script runs a bunch of summary metric programs and compiles the results. It uses
R and LaTex to generate a fancy pdf as an output. Can also genrate html
Required:
-a Alignment statistics from Picard's CollectAlignmentMetrics
-b Alignment counts from USeq's CountChromosome
-c Coverage of CCDS exons from USeq's Sam2USeq
-d Duplication statics from Picard's MarkDuplicates
-e Error rate from USeq's CalculatePerCycleErrorRate
-f Overlap Statistics from USeq's MergePaired Sam Alignment
-o Output file name
Optional
-r Path to R
-l Path to pdflatex
-t Generate html instead
-i Generate dictionary (for pipeline)
-c Coverage file name
Example: java -Xmx1500M -jar pathTo/USeq/Apps/VCFAnnovar -v 9908R.vcf
**************************************************************************************
Alignment file not specified, exiting
**************************************************************************************
** ParseIntersectingAlignments: June 2010 **
**************************************************************************************
Parses bed alignment files for intersecting reads provided another bed file of alleles.
Options:
-s Full path file text for your SNP allele five column bed file (tab delimited chr,
start,stop,text,score,strand)
-a Full path file text for your alignment bed file from the NovoalignParser.
-m Minimum base quality, defaults to 13
Example: java -Xmx1500M -jar pathToUSeq/Apps/ParseIntersectingAlignments
-s /LympAlleles/ex1.bed -a /SeqData/lymphAlignments.bed -m 13
**************************************************************************************
**************************************************************************************
** ParsePointDataContexts: Feb 2011 **
**************************************************************************************
Parses PointData for particular 5bp genomic sequence contexts.
Options:
-s Save directory, full path.
-p PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. These will be merged before splitting by summing overlapping
position scores.
-f Fasta files for each chromosome.
-c Context java regular expression, must be 5bp long, 5'->3', case insensitive, e.g.:
'..CG.' for CG
'..C[CAT]G' for CHG
'..C[CAT][CAT]' for CHH
'..C[CAT].' for nonCG
'..C[^G].' for nonCG
Example: java -Xmx12G -jar pathTo/USeq/Apps/ParsePointDataContexts -c '..CG.' -s
/Data/PointData/CG -f /Genomes/Hg18/Fastas -p /Data/PointData/All/
**************************************************************************************
**************************************************************************************
** PeakShiftFinder: May 2010 **
**************************************************************************************
PeakShiftFinder estimates the bp difference between sense and antisense proximal chIP-
seq peaks. It calculates the shift int two ways: by generating a composite peak from a
set of the top peaks in a dataset and by taking the median shift for the top peaks.
The latter appears more reliable for some datasets. Inspect the results in IGB by
loading the xxx.bar graphs. When in doubt, run ScanSeqs with just your
treatment data setting the peak shift to 0 and window size to 50 and manually inspect
the shift in IGB.
Options:
-t Treatment Point Data directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_+_.bar.zip and xxx_-_.bar.zip files.
-c Control Point Data directories, ditto.
-s Save directory, full path.
Advanced Options:
-e Two chIP samples are provided, no input, scan for reduced peaks too.
-w Window size in bps, defaults to 50.
-a Minimum number window reads, defaults to 10
-d Minimum normalized window score, defaults to 2.5
-r Minimum fold of treatment to control window reads, defaults to 5
-n Number of peaks to merge for composite, defaults to 100
-p Distance off peak center to collect from 5' stop, defaults to 500
-m Distance off peak center to collect from 3' stop, defaults to 1000
Example: java -Xmx1500M -jar pathTo/USeq/Apps/PeakShiftFinder -t
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -c /Data/Input1/,Data/Input2/ -s
/Results/Ets1PeakShiftResults -w 25 -d 5
**************************************************************************************