%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % Gene Prediction with the Command-Line Version of AUGUSTUS % % % % Mario Stanke % % Institut für Mikrobiologie und Genetik % % Abteilung Bioinformatik % % Goldschmidtstraße 1 % % 37077 Göttingen % % Fon: +49 551 3914926 % % mario@soe.ucsc.edu % % % % Date: May 10th, 2006 % % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. INTRODUCTION 2. INSTALLATION 3. RUNNING AUGUSTUS 4. ALTERNATIVE TRANSCRIPTS AND POSTERIOR PROBABILITIES 5. USING HINTS (AUGUSTUS+) 6. RETRAINING AUGUSTUS 7. WEB-SERVER 8. CONTACT ME 9. REFERENCES 1. INTRODUCTION --------------- AUGUSTUS is a gene prediction progam for eukaryotes by Mario Stanke and Stephan Waack. It can be used as an ab initio program, which means it bases its prediction purely on the sequence. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as BLAST search results. Currently, AUGUSTUS has been trained for predicting genes in: - Homo sapiens (human), - Drosophila melanogaster (fruit fly), - Arabidopsis thaliana (plant), - Brugia malayi (nematode), - Aedes aegypti (mosquito), - Coprinus cinereus (fungus), - Tribolium castaneum (bug) - Schistosoma mansoni (worm) - Tetrahymena thermophila (ciliate) - Galdieria sulphuraria (red algae) - Zea mays (maize) The species parameter files of the following species are a courtesy of Jason Stajich (see also http://fungal.genome.duke.edu/). - Caenorhabditis elegans (worm) - Saccharomyces cerevisiae (baker's yeast) - Ustilago maydis (fungus) - Phanerochaete chrysosporium (fungus) - Neurospora crassa (bread mold) - Histoplasma capsulatum (fungus) - Fusarium graminearium (fungus) - Cryptococcus neoformans (fungus) - Aspergillus nidulans (fungus) 2. INSTALLATION --------------- 1. unpack >tar -xzf augustus.1.8.2.tar.gz The tar-archive contains one directory 'augustus' with the following sub-directories: bin src include config examples scripts 2. set environment variable AUGUSTUS_CONFIG_PATH > export AUGUSTUS_CONFIG_PATH=/my_path_to_AUGUSTUS/augustus/config/ The program requires that the environment variable AUGUSTUS_CONFIG_PATH is set to the config directory that contains the configuration and parameter files. This is the directory 'augustus/config'. You probably want to add this line to a startup script (like ~/.bashrc). Alternatively, you can specify this directory on the command line when you run augustus: --AUGUSTUS_CONFIG_PATH=/my_path_to_AUGUSTUS/augustus/config/ 3. make augustus/bin/augustus executable (if necessary) > chmod +x augustus/bin/augustus You may want to add the path of the executable to the PATH environment variable or copy augustus into a common directory (e.g. /usr/bin/). In case you recompile the source code on a platform other than linux (e.g. with g++) I would be very gratefull if you could email me the executables 'augustus' and 'etraining'. Then we could share it with other users. 3. RUNNING AUGUSTUS ------------------- AUGUSTUS has 2 mandatory arguments. The query file and the species. The query file contains the DNA input sequence and must be in uncompressed (multiple) fasta format, e.g. the file may look like this >name of sequence 1 agtgctgcatgctagctagct >name of sequence 2 gtgctngcatgctagctagctggtgtnntgaaaaatt Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted. usage: augustus [parameters] --species=SPECIES queryfilename SPECIES is one of the following identifiers identifier | species ----------------------|---------------------- human | Homo sapiens fly | Drosophila melanogaster arabidopsis | Arabidopsis thaliana brugia | Brugia malayi aedes | Aedes aegypti coprinus | Coprinus cinereus tribolium | Tribolium castaneum schistosoma | Schistosoma mansoni tetrahymena | Tetrahymena thermophila galdieria | Galdieria sulphuraria maize | Zea mays elegans | Caenorhabditis elegans saccharomyces | Saccharomyces cerevisiae ustilago | Ustilago maydis pchrysosporium | Phanerochaete chrysosporium neurospora | Neurospora crassa histoplasma | Histoplasma capsulatum fusarium | Fusarium graminearium cryptococcus | Cryptococcus neoformans anidulans | Aspergillus nidulans 'queryfilename' is the filename (including relative path) to the file containing the query sequence(s) in fasta format. important parameters: --strand=both, --strand=forward or --strand=backward report predicted genes on both strands, just the forward or just the backward strand. default is 'both' --genemodel=partial, --genemodel=complete, --genemodel=atleastone or --genemodel=exactlyone partial : allow prediction of incomplete genes (default) complete : only predict complete genes atleastone : predict at least one complete gene exactlyone : predict exactly one complete gene --singlestrand=true predict genes independently on each strand, allow overlapping genes on opposite strands This option is turned off by default. --hintsfile=hintsfilename When this option is used the prediction considering hints (extrinsic information) is turned on. hintsfilename contains the hints in gff format. --extrinsicCfgFile=cfgfilename Optional. This file contains the list of used sources for the hints and their boni and mali. If not specified the file "extrinsic.cfg" in the config directory $AUGUSTUS_CONFIG_PATH is used. --maxDNAPieceSize=n When --singlestrand=false (default), AUGUSTUS uses a little more than 1MB per kilo base sequence lenght memory. When run with --singlestrand=true it uses about half the memory. If the sequence is too long for the memory of your computer, you can specify the maximal length of the pieces that the sequence should be cut into. Default is --maxDNAPieceSize=200000 (< 250MB RAM required). AUGUSTUS tries to place the boundaries of these pieces in the intergenic region, which is inferred by a preliminary prediction. GC-content dependent parameters are chosen for each piece of DNA. This is why this value should not be set very large, even if you have plenty of memory. --protein=on/off --introns=on/off --start=on/off --stop=on/off --cds=on/off --codingseq=on/off Output options. Output predicted protein sequence, introns, start codons, stop codons. Or use 'cds' in addition to 'initial', 'internal', 'terminal' and 'single' exon. The CDS excludes the stop codon (unless stopCodonExcludedFromCDS=false) whereas the terminal and single exon include the stop codon. --alternatives=true/false --sample=n --minexonintronprob=p --minmeanexonintronprob=p --maxtracks=n For a description of these parameters see section 4 below. For example you may type in the 'bin' directory >augustus --species=human --strand=forward --introns=off ../examples/example.fa The output format is gtf similar to General Feature Format (gff), see http://www.sanger.ac.uk/Software/formats/GFF/. It contains one line per predicted exon. Example: HS04636 AUGUSTUS initial 966 1017 . + 0 transcript_id "g1.1"; gene_id "g1"; HS04636 AUGUSTUS internal 1818 1934 . + 2 transcript_id "g1.1"; gene_id "g1"; The columns (fields) contain: seqname source feature start end score strand frame transcript and gene name AUGUSTUS also accepts files in annotated GENBANK format as input. This is needed for training. Also when predicting on a genbank file AUGUSTUS compares its prediction to the annotation and prints out a statistic. Example genbank file format accepted by AUGUSTUS: LOCUS HS04636 9453 bp DNA FEATURES Location/Qualifiers source 1..9453 CDS join(966..1017,1818..1934,2055..2198,2852..2995,3426..3607, 4340..4423,4543..4789,5072..5358,5860..6007,6494..6903) BASE COUNT 2937 a 1716 c 1710 g 3090 t ORIGIN 1 gagctcacat taactattta cagggtaact gcttaggacc agtattatga ggagaattta 61 cctttcccgc ctctctttcc aagaaacaag gagggggtga aggtacggag aacagtattt 121 cttctgttga aagcaactta gctacaaaga taaattacag ctatgtacac tgaaggtagc ... 9421 aaaaaaaaaa aaaaatcgat gtcgactcga gtc // Another example that is important for training the UTR models. The following genbank file will be interpreted as having three genes. One gene ('A') with both 5' and 3' UTR and two single UTRs without matching coding sequence. Gene 'B' consists just of the 5'UTR, gene 'C' just of the 3' UTR. LOCUS example2 9453 bp DNA FEATURES Location/Qualifiers source 1..9453 mRNA join(100..200,900..1017,1818..2000,2100..2200) /gene="A" CDS join(966..1017,1818..1934) /gene="A" mRNA join(3100..3200,3500..>3600) /gene="B" mRNA join(<4100..4200,4500..4600) /gene="C" BASE COUNT 2937 a 1716 c 1710 g 3090 t ORIGIN 1 gagctcacat taactattta cagggtaact gcttaggacc agtattatga ggagaattta 61 cctttcccgc ctctctttcc aagaaacaag gagggggtga aggtacggag aacagtattt 121 cttctgttga aagcaactta gctacaaaga taaattacag ctatgtacac tgaaggtagc ... 9421 aaaaaaaaaa aaaaatcgat gtcgactcga gtc // 4. ALTERNATIVE TRANSCRIPTS AND POSTERIOR PROBABILITIES ------------------------------------------------------ Alternative transcripts ----------------------- When you say on the command line --alternatives=true or edit the appropriate line in the configuration file for your species to alternatives true then AUGUSTUS may reports multiple transcripts per gene. A gene is then defined as a set of transcripts, whose coding sequences (indirectly) overlap. The number of alternatives AUGUSTUS reports for a gene depends on which ones are likely alternatives. If just one transcript is likely in that region then just also just one transcript is reported. The behavior of AUGUSTUS can be adjusted with the parameters --minexonintronprob=p --minmeanexonintronprob=p --maxtracks=n (default -1, no limit) The posterior probability of every exon and every intron in a transcript must be at least 'minexonintronprob', otherwise the transcript is not reported. minexonintronprob=0.1 is a reasonable value. In addition the geometric mean of the probabilities of all exons and introns must be at least 'minmeanexonintronprob'. minmeanexonintronprob=0.4 is a reasonable value. The maximum number of tracks when displayed in a genome browser is 'maxtracks' (unless maxtracks=-1, then it is unbounded). In cases where all transcripts of a gene overlap at some position this is also the maximal number of transcripts for that gene. I recommend increasing the parameter 'maxtracks' for improving sensitivity and setting 'maxtracks' to 1 and increasing minmeanexonintronprob and/or minexonintronprob in order to improve specificity. Posterior probabilities ----------------------- AUGUSTUS reports the posterior probabilities of exons, introns, transcripts and genes. The posterior probability of an exon is the conditional probability that the random gene structure has some exon with these coordinates on this strand given the input sequence. It not only depends on the sequence in the range of the exon itself like an exon score but is influenced for example by the possibilities of compatible neighboring exons. The intron score is similar. The reported probability of a transcript is the probability that a splice variant is exactly like in the given transcript. The reported probability of a gene is the probability that SOME coding sequence is in the reported range on the reported strand, regardless of the exact transcript. The posterior probabilities are estimated using a sampling algorithm. The parameter --sample==n adjusts the number of sampling iterations. The higher 'n' is the more accurate is the estimation but it usually isn't important that the posterior probability is very accurate. Every 30 sample iterations take about the same time as one run without sampling, e.g. --sample=60 takes about 3 times as much time as --sample=0 (which was standard up to version 1.6). The default is --sample=100 If you do not need the posterior probabilities or alternative transcripts, say --sample=0 There are 3 common scenarios for above parameters, depending on what you want: Just output the most likely gene structure as in previous versions. No posterior probabilities, no alternatives: --sample=0 --alternatives=false Output the most likely gene structure and report posterior probabilities: --sample=100 --alternatives=false Output alternative transcripts and report posterior probabilities: --sample=100 --alternatives=true Be aware that sampling is pseudorandom and that the results may vary from machine to machine. 5. USING HINTS (AUGUSTUS+) -------------------------- AUGUSTUS can take hints on the gene structure. It accepts 6 types of hints: start, stop, ass, dss, exonpart, exon. In addition hints of type intron are allowed in case of manual hints, though as a user constraint on the gene structure. The hints must be stored in a file in gff format containing one hint per line. Example of a hintsfile: HS04636 mario exonpart 500 506 . - . source=M HS04636 mario exon 966 1017 . + 0 source=P HS04636 AGRIPPA start 966 968 6.3e-239 + 0 gb|AAA35803.1 source=P HS04636 AGRIPPA dss 2199 2199 1.3e-216 + . gb|AAA35803.1 source=C HS04636 mario stop 7631 7633 . + 0 source=M HS08198 AGRIPPA intron 2000 2000 0 + . ref|NP_000597.1 source=M HS08198 AGRIPPA ass 757 757 1.4e-52 + . ref|NP_000597.1 source=E The fields must be separated by a tabulator. In the first column (field) the sequence name is given. In this case the hints are together about two sequences. The second field is the name of the program that produced the hint. It is ignored here. The third column specifies the type of the hint. The 4th and 5th column specify the begin and end position of the hint. Positions start at 1. The 6th colum gives a score. The 7th the strand. The 8th the reading frame as defined in the GFF standard. The 9th column contains arbitrary extra information but it must contain a string 'souce=X' where X is the source identifier of the hint. Which values for X are possible is specified in the file augustus/config/extrinsic.cfg, e.g. X=M, E, or P. start : Refers to the start codon at the translation start. stop : Refers to the stop codon at the translation stop. ass : Refers to the last (most 3') position of an intron. dss : Refers to the first (most 5') position of an intron. exonpart : Refers to an interval that is though to be coding. exon : Refers to an interval that is though to be an exon (an initial, internal, terminal or single exon). intron : Refers to an interval that is though to be part of an intron. Only a manual (source=M) intron hint is possible. AUGUSTUS can follow a hint, i.e. predict a gene structure that is compatible with it, or AUGUSTUS can ignore a hint, i.e. predict a gene structure that is not compatible with it. The probability that AUGUSTUS ignores a hint is the smaller the more reliable the hints of this type are. This reliability must have been determined before using a set of annotated sequences and the set of hints for this set. The result is a table table of bonuses and maluses that is stored in augustus/config/extrinsic.cfg. When hints are manually set or generated with AGRIPPA using BLAST database searches in an EST or Protein database the bonuses and maluses have been determined for Homo sapiens. They are stored in the files augustus/config/extrinsic.M.cfg augustus/config/extrinsic.human.ME.cfg augustus/config/extrinsic.human.MC.cfg augustus/config/extrinsic.human.MP.cfg augustus/config/extrinsic.human.MPEC.cfg The characters in the filename before .cfg show which sources have been used. Run AUGUSTUS using the --hintsfile option and the appropriate .cfg file. Example: >augustus --species=human --hintsfile=../examples/hints.gff --extrinsicCfgFile=../config/extrinsic.human.MPEC.cfg ../examples/example.fa As an alternative to giving the option --extrinsicCfgFile you can replace augustus/config/extrinsic.cfg with the appropriate file, as this file is read by default when the option --extrinsicCfgFile is not given. WARNING: The rest of section 5. is intended for internal use. (Retraining the hint parameters) -------------------------------- When the process (the program) that genenerates the hints is new, then the bonuses and maluses must be adapted. Take some training set of annotated genes in genbank format (multiple genes per sequence possible, but must be sorted). Obtain the hints for this training set. Create of copy the file augustus/config/extrinsic.cfg, so that it contains the sources of only those hints that have been obtained for the training set. The numbers in the table do not matter, they are ignored. Nevertheless the format must be correct. Run augustus on the training set with the hints as parameters, e.g. >augustus --species=HUMAN --checkExAcc=true ../config/h178.test --hintsfile=~/human/extrinsic/h178est/agrippa.h178.E.gff At the top of the output a summary of how many hints are compatible with the annotated gene structure (good) and how many hints are incompatible with the annotated gene structure (bad) is given. This is just for info. after a line reading '-----------configfile-------------' a table of bonuses and maluses is output. Copy these 6 lines to the extrinsic.cfg file and replace the corresponding 6 lines. In the case when there are hints that can occur multiple times, for example a dss hint from an EST search and the same dss hint from a protein search, there is an additional step. When several hints are identical, except for the source, then AUGUSTUS uses just that hint, that is most reliable. The others are considered redundant and are deleted. When there are redundant hints in your data you must first run AUGUSTUS with the option --checkExAcc=true. Then no redundant hints are deleted. Take the resulting table (even if some values are negetive), replace the one in extrinsic.cfg with it and run AUGUSTUS another time without turning this option on. Again replace the table in extrinsic.cfg with the one from the program output. Remark: In the second run, AUGUSTUS knows in the case of collisions of which source the hints should be deleted. 6. RETRAINING AUGUSTUS ---------------------- This documentation is under development. See the file retraining.html. AUGUSTUS uses parameters which are species specific like the Markov chain transition probability of coding and non-coding regions. These parameters can be trained on training sets of annotated genes in genbank format. They are stored in the config directory in 3 files containing the parameters for the exon-related, intron-related and intergenic-region-related parameters, e.g. human_exon_probs.pbl, human_intron_probs.pbl, human_igenic_probs.pbl. For each species there are also parameters like the order of the markov chain or the size of the window used for the splice site models. Let's call these meta parameters. These meta parameters are stored in a separate file, e.g. human_parameters.cfg. Which meta parameters work best depends on the species and on the training set, in particular on the size of the training set. Using the meta parameters of another species or for another training set is likely to result in poor prediction performance. The meta parameters are not documented sufficiently. However, when optimizing the meta parameters for a new species it helps to know their meaning. Please contact me in case you want me to do the training. The program 'etraining' reads the meta parameters from the .cfg file and a genbank file with annotated genes and writes the other species specific parameters into the 3 .pbl files. usage: etraining --species=SPECIES trainfilename 'trainfilename' is the filename (including relative path) to the file in genbank format containing the training sequences. These can be multi-gene sequences and genes on the reverse strand. However, the genes must not overlap. 7. WEB-SERVER ------------- AUGUSTUS can also be run through a web-interface avaible on the AUGUSTUS home page: http://augustus.gobics.de. 8. CONTACT ME ------------- Please don't hesitate to contact me in case you find a bug, or miss a desireable feature or need executables for a different platform, need detailed info about the model, .... In case you need to run AUGUSTUS on a different organism and have at least 200 annotated genes (genbank or gff) as training data for AUGUSTUS I could make another specially trained version of AUGUSTUS. 9. REFERENCES ------------- Mario Stanke, Oliver Schöffmann, Burkhard Morgenstern and Stephan Waack "Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources", BMC Bioinformatics, 7:62 (2006) Mario Stanke and Burkhard Morgenstern (2005) "AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints", Nucleic Acids Research, 33, W465-W467 Mario Stanke, Rasmus Steinkamp, Stephan Waack and Burkhard Morgenstern, "AUGUSTUS: a web server for gene finding in eukaryotes" (2004), Nucleic Acids Research, Vol. 32, W309-W312 Mario Stanke (2003), Gene Prediction with a Hidden-Markov Model. Ph.D. thesis, Universität Göttingen, http://webdoc.sub.gwdg.de/diss/2004/stanke/ Mario Stanke and Stephan Waack (2003), Gene Prediction with a Hidden-Markov Model and a new Intron Submodel. Bioinformatics, Vol. 19, Suppl. 2, pages ii215-ii225