MotifRegressor Release 1. April 2003 A DNA sequence motif finding program MotifRegressor is free to academia. Please do not distribute the executable of the program to others without a license or consent of the owners. MotifRegressor requires Splus Version 5.1 or higher. MotifRegressor is available for Sun and Linux systems. MotifRegressor is a program written in the Perl programming language that automates the motif finding program "MDmodule" and the regression program "regressExpr.spl". It finds a group of motifs that are involved in the regulation of genes in one microarray experiment. MDmodule is provided as an executable program. MotifRegressor.pl and regressExpr.spl are provided as source code. Users can change this source code to specific needs, but cannot license any edited code for sale. We provide code as is, and will not add additional features. MDmodule improves on the program MDscan. MDscan is very useful for a group of sequences that might contain a transcription factor binding motif, if biologists are more confident about a subgroup of the sequences. The criteria for the "subgroup of sequences" is: 1. They are more likely to be the TF's binding targets 2. TF binding sites appear more frequently here than the rest of the sequences MDscan is mostly useful for two cases: 1. For ChIP-array (ChIP-on-chip) experiments. E.g. Biologists identified 200 TF targets, but they are most confident about the top 50. In this case, the input sequence will have all 200 target sequences, with the best 50 in the front, and specify to search for candidate motifs from the top 50 sequences first. 2. For gene expression experiments. E.g. Under certain condition, biologists observed 300 genes whose expression increase more than 2 fold, among which 35 have expression increase more than 5 fold. In this case, the input sequence will have the upstream sequence of all 300 genes, and the 35 sequences with 5 fold change are in the front. Specify to search for candidate motifs from the top 35 sequences first. The basic strategy of MDscan is to search for motifs from high confident sequences first because in these sequences signal noise ratio is higher. MDmodule improved MDscan in the following ways: 1. If several final motifs share the same consensus, only the one with the highest score is kept. This way, you won't see the same motif coming up multiple times. 2. Once a motifs is found, MDmodule scans the whole input sequence file (say your input have 6000 sequences, you can -t 10 -c 100, and then scan the 6000 sequences using the motifs emerged) for more hits. 3. MDmodule does Monte Carlo simulation to give some statistics on the motif significance. This is very similar to the Monte Carlo simulation used in BioProspector. 4. The motif width does not need to be specified a priori. MotifRegressor combines MDmodule and regression to find motifs that are significantly correlated with either gene expression or ChIP-array data. NOTE: If your input sequence (especially the top ones) have some repeat sequences such as TTTTTTTTTTTTTTT or ACACACACACACACACACACA, the program will fail. Actually the program will converge on these kind of non-sense motifs. So, you should remove these simple repeats before you run the program. A sequence cleaning procedure named RepeatMasker (Smit, AFA & Green, P) is available from the University of Washington at: http://ftp.genome.washington.edu/RM/RepeatMasker.html Synopsis There are 2 methods for running MotifRegressor. a) Usage: ./MotifRegressor.pl Follow onscreen prompts b) Usage: ./MotifRegressor.pl (options) This option requires 17 input parameters. 1) File containing expression data (e.g. yeast.txt). This file is in the format: Column 0: gene name Column 1 and higher: gene expression 2) File containing upstream sequence of all the spots on the array (e.g. yeast.seq). The gene names must match those in 1) Right now, the program recognizes FASTA format 3) File containing the background sequence distribution (e.g. yeast.int, specify "null" if unknown). If there is no background file specified, the background values will be calculated based on the input sequences, and a file named "bgfreq" is created. Included in the package is the pre-computed yeast intergenic sequence distribution file "yeast.int". 4) The column from 1) that is used to rank all of the sequences (e.g. 15). This can be either expression values or p-values of expression values. 5) The column from 1) that is used to perform regression (e.g. 15). The column with gene names is column 0. Typically, 4) and 5) are the same and it is the column of expression values. An exception is when ranking by p-value but performing regression based on expression. 6) The format of the data in 5): 0 - fold change; 1 - natural log; 2 - base 2 log; 3 - base 10 log 7) Where do you look for motifs in 5): 1 - high values; -1 - low values; 0 - both high and low values 8) Do you use count or value to define top sequences and sequences to confirm motifs? When using count, the top e.g. 100 genes are used to find motifs, and the top e.g. 500 genes are used to confirm the motifs. When using value, the genes with expression fold-change greater than e.g. 2.0 will be used to find motifs, and genes with expression fold-change greater than e.g. 1.5 will be used to confirm motifs. 0 - count; 1 - value 9) This field must be entered even if low values (-1) was specified in 7). If low values (-1) was specified in 7), enter any value here. It will be ignored, but it is necessary. For "count" specified in 8): enter the number of sequences with high values that are defined as top. Limits are [10, 200] (e.g. 50) For "value" specified in 8): enter the value threshold for high values that are defined as top (e.g. 2.0) 10) This field must be entered even if low values (-1) was specified in 7). If low values (-1) was specified in 7), enter any value here. It will be ignored, but it is necessary. For "count" specified in 8): enter the number of sequences with high values that are defined as confirm. Limits are [30, 1000] (e.g. 250) For "value" specified in 8): enter the value threshold in high values that are defined as confirm (e.g. 1.5) 11) This field must be entered even if high values (1) was specified in 7). If high values (1) was specified in 7), enter any value here. It will be ignored, but it is necessary. For "count" specified in 8): enter the number of sequences with low values that are defined as top. Limits are [10, 200] (e.g. 50) For "value" specified in 8): enter the value threshold in low values that are defined as top (e.g. 0.7) 12) This field must be entered even if high values (1) was specified in 7). If high values (1) was specified in 7), enter any value here. It will be ignored, but it is necessary. For "count" specified in 8): enter the number of sequences with low values that are defined as confirm. Limits are [30, 1000] (e.g. 250) For "value" specified in 8): enter the value threshold in low values that are defined as confirm (e.g. 0.5) 13) What is the maximum motif width? The limits are [5, 20] (e.g. 15) 14) What is the minimum motif width? The limits are [5, 20] (e.g. 5) 15) How many seed candidate motifs to find? The limits are [10, 150] (e.g. 50) 16) How many motifs to report before regression? The limits are [3, 100] (e.g. 30) 17) Please give this experiment a name ([a-zA-Z0-9], e.g. CellCycle) Example 1 (this example uses the files "Example1Expression.txt" and "Example1Sequence.txt" that are included in the package) ./MotifRegressor.pl Example1Expression.txt Example1Sequence.txt null 1 1 2 0 0 50 100 40 80 5 7 50 30 Example1 ==> Find motifs from the file "Example1Expression.txt" that contains the expression data and the file "Example1Sequence.txt" that contains the sequence data. No background file is specified, so the background values are calculated from the input sequence data, and a file named "bgfreq" is created. The motifs will be widths from 5 to 7, ranked by column 1 of "Example1Expression.txt" and with expression performed on column 1 of "Example1Expression.txt", which contains the expression fold-change values. The format of the data is base 2 log fold-change values. This looks for motifs in both the high and low-value genes. This uses counts to define the top and confirming sequences. For high values, this uses the top 50 genes to find motifs, and the top 100 genes to confirm the motifs. For low values, this uses the top 40 genes to find motifs, and the top 80 genes to confirm the motifs. The program finds 50 seed candidate motifs for each width, and reports 30 motifs before regression. The output files begin with "Example1". Example 2 ./MotifRegressor.pl yeast.txt yeast.seq yeast.int 20 21 2 1 1 2.0 1.5 0 0 10 20 40 20 Example2 ==> Find motifs from the file "yeast.txt" that contains the expression data and the file "yeast.seq" that contains the sequence data, with the background sequence distribution based on the file "yeast.int" (included in the package). The motifs will be widths from 10 to 20, ranked by column 20 of "yeast.txt" which contains the p-values of the expression values, and with expression performed on column 21 of "yeast.txt", which contains the expression fold-change values. The format of column 21 of "yeast.txt" is base 2 log fold-change values. This looks for motifs in the high-value genes only. This uses values to define the top and confirming sequences. For high values, the top genes are those with log base 2 fold-change in expression greater than 2.0, and the confirming genes are those with log base 2 fold-change in expression greater than 1.5. For low values, the 2 options are specified as 0 but not used. The program finds 50 seed candidate motifs for each width, and reports 30 motifs before regression. The output files begin with "Example2". Output format 6 Files for experiment name "Example1": 1) Example1.txt This contains a summary of the input files and parameters. 2) Example1.allmtfs This contains the motif matrices of all motifs produced by MDmodule, with the following format. a) Motif name, Motif score (see MDscan paper, referenced below), Motif z-score if a Monte Carlo simulation is performed, average probability of the most probable base at each position of the motif, number of aligned segments, Consensus, Reverse Consensus. b) Motif probability matrix (one line per motif column), Consensus, Reverse Consensus, Degenerate, Reverse Degenerate, all in TRANSFAC format. 3) Example1.data This contains the sequence scores for each gene for each motif. 4) Example1.goodmtf This contains a list of all motif matrices that are significant in multiple regression. Note that the motifs do not appear in the order of significance. See file format of "2) Example.allmtfs" above. 5) Example1.single This file contains all motifs that are significant in simple linear regression, with the following fields. The motifs appear in the order of significance. a) motif consensus b) reverse motif consensus c) motif name: e.g. Motif.P147.5.11 "P" if found from high value genes "N" if found from low value genes "147" indicates the motif was based on data in column 147 of the expression data file "5" indicates the motif is width 5 "11" indicates it is the 11th motif of width 5 d) F-value This is the F-value from the simple linear regression e) P-value This is the p-value from the simple linear regression f) S-Coeff This is the simple linear regression coefficient value g) R-sq This is the r-squared value for the simple linear regression 6) Example1.multi This file contains all motifs that are significant in multiple regression, with the following fields. The motifs appear in the order of significance. a) motif consensus b) reverse motif consensus c) motif name (see 6) above) d) T-value This is the T-value from multiple regression e) P-value This is the p-value from multiple regression f) M-Coeff This is the multiple regression coefficient value g) S-Coeff This is the simple linear regression coefficient value h) M-r-sq This is the r-squared value for multiple regression i) S-r-sq This is the r-squared value for the simple linear regression Description of options for MDmodule -w -t -c You can have a input with 6000 sequences, but still look for candidate motif from top 50 sequences (-t 50), refine the motif with top 250 sequences (-c 250) and ignore the rest. -e If you expect to see one site every 1000 bases, then you can specify -e 1000. Of course, most of the time we don't know how frequent the motif occurs (we don't even know what it is), in which case don't specify -e at all. -f Precomputed background distribution (to speed up the program), which can be obtained by running the included program genomebg. Included are pre-computed yeast whole genome and yeast intergenic sequence distribution. To run genomebg, type: ./genomebg -i inputSequenceFile -o outputDistributionFile InputSequenceFile contains whole genome (or just intergenic) sequences in restricted FASTA format. The outputDistributionFile is the one you can use on MotifRegressor by specifying -f. -b If you want to use another sequence file which contains sequences that represent the background. Should be the same format as seqfile. -s Many candidate motifs will be generated, and only the good ones will be kept for refinement step. This number specifies how many the program should keep for refinement. -r After refinement, the best motifs are reported to the user as a TF binding motifs. This number specify how many final motifs will be reported. -n The candidate motif refinement step is done by iterations, and the motif usually converges within 10 iterations. If you want it to run longer till convergence, then you can specify another number. -o -g 1 During the run, the program prints out messages once in a while to report the progress of the program. If you don't want to see this (e.g. if you are running the program in the background), then specify -g 1. References Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 2002, 20:835-839. Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis. PNAS 2003, 100:3339-3344. For questions, bug report, and request for source code, please contact X. Shirley Liu at xsliu@jimmy.harvard.edu or Erin M. Conlon at econlon@fas.harvard.edu