########################### # README FOR PHYLONET-v2b # ########################### Copyright 2004--2005 Ting Wang and Gary Stormo May be copied for noncommercial purposes. Author: Ting Wang and Gary Stormo Department of Genetics Washington University in St. Louis Campus Box 8232 St. Louis, MO 63110 stormo@ural.wustl.edu twang@ural.wustl.edu PhyloNet (version 2b) ################# # BASIC OPTIONS # ################# Usage: ./PhyloNet-v2b Use the "-h" option for detailed directions -f -d [-h ] [-a ] [-q ] [-w ] [-ez ] [-iq ] [-id ] [-c0 ] [-c1 ] [-c2 ] [-s 4: approximate ALLR -> accurate ALLR>] [-th ] [-m ] [-u0 ] [-u1 ] [-u2 ] [-v ] [-r1 ] [-r2 ] [-pf ] [-o1 ] [-o2 ] ####################### # GENERAL INFORMATION # ####################### PhyloNet is a motif discovery program. PhyloNet stands for "Phylogenetic Regulatory Network". It represents a very new paradigm for motif discovery: based on sequences of several evolutionarily related genomes, PhyloNet predicts a near complete set of conserved motifs of the organism of interest, as well as gene clusters that share these motifs, without reliance of additional data such as gene regulation. The algorithm takes advantage of two important features of a regulatory network: phylogenetic conservation and network topology. The architecture of the program follows that of BLAST. The input sequences are divided into two parts: query and database. The query is the "gene of interest", or the promoter sequence of the gene of interest; the database contains all the genes/promoters of the genome. For each promoter, a few orthologous promoters are needed, but they don't have to come from the same set of genomes or have the same number. Basically, each "unit" is a group of orthologous sequences, with the first sequence being the promoter of the genome of interest. Just like BLAST, the algorithm compares the promoter of interest to all the promoters of the genome and determine local similarities between the query and database promoters. The output of the program contains motifs of the promoter of interest, together with a set of genes that share this motif. The program will determine the width of the pattern being sought. For whole genome motif discovery analysis, one can simply use every promoter as a query and all promoters as database to run PhyloNet multiple times, and consolidate the predictions. Before running PhyloNet, a separate step of "phylogenetic footprinting" needs to be performed. Bundled with PhyloNet, we use the algorithm "wconsensus" for phylogenetic footprinting. This algorithm locally alignes sequences of multiple genomes, producing multiple, suboptimal ungapped alignments. By replacing the input module of PhyloNet one can use other algorithms for this step. Following phylogenetic footprinting, the algorithm has these components: 1) Phylogenetic footprinting of the promoters: Wconsensus algorithm is used to extract conserved regions of the promoters based on reference genome sequences. 2) Promoter profile construction: multiple, suboptimal sequence alignments from phylogenetic footprinting are converted to sequence profiles. 3) Profile space partition: continuous profile space is partitioned into discrete profile clusters. Each partitioned profile space is represented by a single profile. Distances among the spaces are calculated by ALLR statistic. An ALLR scoring matrix is constructed for profile comparison. 4) Query hashing: the query promoter profiles are converted into a collection of formatted ˇ°seeds (or words)ˇ± of flexible length. Neighborhood words of each seed is generated via a branch and bound algorithm. A hash (or index) is built for the query promoter. 5) Motif BLAST: the entire database (all promoter profiles) are searched against query hash to locate word hit then each hit is extended via a local dynamic programming to a high scoring pair (HSP). The significance of these HSPs is estimated by Karlin-Altschul statistic. 6) HSP clustering: Significant HSPs are mapped back to the query promoter, and are clustered by applying a maximum clique finding algorithm from graph theory, based on the overlapping relations among HSPs. 7) Motif construction: Clustered HSPs are converted to motifs using a greedy approach. Final significance of the motif is estimated based on sum of p-values. 8) Background control: The algorithm has options to shuffle either the query promoter, or the database, or both, while conserving the sequence identity, sequence length and length of conserved blocks. The program will run on the shuffled datasets to generate background score distribution. ########################### # FORMAT OF THE SEQUENCES # ########################### Sequences of a orthologous group are formatted according to consensus, and put into one separate file. For example, gene1 has 2 orthologs, then there will be a file called "gene1.cons" and the sequences are formatted as: [ modifiers.. ] gene1 ; any description of the seq \ AACC.... the actual sequence \ [ modifiers.. ] gene1_2 ; any description of the seq \ AACC.... the actual sequence \ [ modifiers.. ] gene1_3 ; any description of the seq \ AACC.... the actual sequence \ The rules are: 1) Each sequence has two components: a description line where you can add modifiers; actual sequence, wrapped by "\". 2) The first sequence is the reference sequence, or the sequence of the genome of interest. The order of the rest of the sequences are not important. 3) The name of the sequence of the genome of interest must be unique. The names of the rest of the genomes need not. 3) Sequence modifiers appear in front of the name of the relevant sequence. They are: -s integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are seed sequences. If the "-s" modifier is used anywhere in the input file, then the initial set of matrices will only be constructed (i.e., seeded) from the sequences within the marked regions. If this modifier is not used anywhere in the input file, then all the sequences will be used to seed matrices. One or more integer pair can be indicated for a single sequence. However, if no integer pairs are given, the whole sequence will be used for seeding matrices. -i integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are the only positions to be analyzed. -e integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are to be excluded from the analysis. When both the "-i" and "-e" modifiers are used, the intersection of permissible positions is analyzed. When a sequence name is not marked by either the "-i" or "-e" modifier, then the whole sequence is included in the analysis. Do not explicitly give the complements of nucleic acid sequences. The complementary sequence is determined by the program. Whitespace, periods, dashes (unless part of an integer when the "-i" option is used), and comments beginning with ';', '%', or '#' are ignored. When using letter characters (i.e., with the "-a" and "-A" alphabet options), integers are also ignored so that the sequence file can contain positional information. In the database, each gene group will have its own subfolder, inside which the sequence file "gene_name.cons" and alignment file "gene_name.wout" reside. The folder name is "gene_name". At the top level, there will be a database description file, which contains the unique gene names, one gene per line. For example, if the genome has 5 genes and their names are aa, bb, cc, dd, ee, then the structure of the database is: >ls (current dir) aa bb cc dd ee database aa through ee are folders, and database is a plain text file that contains the names: >more database aa bb cc dd ee Now go to each of the subfolder: >cd aa >ls aa.cons aa.wout Under the aa subfolder, there is a sequence file with contains the formatted sequences of gene aa and its orthologs, and an alignment file which is the output of wconsensus on aa.cons. The other folders should contain similarly formatted data. ######################### # COMMAND LINE OPTIONS: # ######################### 0) -h: print these directions. 1) General information -f queryname "queryname.cons" contains sequences formatted as described above. "queryname.wout" contains alignments formatted as described above. -d database "database" file contains the description of the database, i.e. all the genenames. -q integer the maximum number of HSPs saved for a query when comparing query profiles to database profiles (default: save 1000 HSPs). -w format the format of the seeds: a string of 0 and 1. default value is 111111. Length of the string is the length of the seed, number of 1s in the string is the "weight" of the seed. 1 means "match" state, and "0" means "don't care state". -iq integer number of initial profiles to import for the query promoter -id integer number of initial profiles to import for database promtoers 2) Alphabet options -a filename: file containing the alphabet and normalization information. Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases; however, if the "-d" option is not used, the frequencies observed in your own sequence data are used. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization Example alphabet file 1: A:T C:G Example alphabet file 2: A:T 0.3 C:G 0.2 4) Options for handling the complement of nucleic acid sequences--- the 3 options in this section are mutually exclusive. -c0: ignore the complement (the default option) -c1: include both strands as separate sequences -c2: include both strands as a single sequence (i.e., orientation unknown) These options are inherited from Consensus. [-s 4: approximate ALLR -> accurate ALLR>] [-th ] [-m ] [-u0 ] [-u1 ] [-u2 ] [-v ] [-r1 ] [-r2 ] [-pf ] [-o1 ] [-o2 ] 5) Algorithm options -ez: once turned on, the algorithm assumes a faster mode, but less sensitive. use it only when the conservation is very high. -s integer (1->4) scoring options. Recommend value: 4 the scoring system of the algorithm is ALLR statistics. to increase speed an approximation of ALLR substitution table is implemented. scoring option 1, 2, 3, 4 gradually reduces the level of approximation. 4 means the final scores are real ALLRs thus is recommended. But 4 is also the slowest option. -th double the threshold for saving an HSP. -m integer the minimal number of sites/promoters of a motif. -u integer (0,1,2) options of handling unrecognized characters in the input sequences 0: Unrecognized characters are errors 1: Unrecognized characters are discontinuities, but print warning 2: Unrecognized characters are discontinuities, and print NO warning (default) -v "Verbose" bit. if turned on, will report detailed program progress status. -o (1, 2) Order of motifs. 1: order by tollr score (total ALLR) 2: order by p-value (recommended option) -r (1, 2) control shuffling procedure. 1: shuffle the query 2: shuffle both the query and the database (recommended option) 6) Output options -pf integer [default: 5] the number of motifs to print at the end of the program