MDscan Release 1. Feburary 2003 A DNA seqence motif finding program. For a group of sequences that might contain a transcription factor binding motif, if biologists are more confident about a subgroup of the sequences, then MDscan will be really useful. The criteria for the "subgroup of sequences" is: 1. They are more likely to be the TF's binding targets 2. TF binding sites appear more frequently here than the rest of the sequences MDscan is mostly useful for two cases: 1. For ChIP-array (ChIP-on-chip) experiments. E.g. Biologists identified 200 TF targets, but they are most confident about the top 50. In this case, the input sequence will have all 200 target sequences, with the best 50 in the front, and and specify to search for candidate motifs from the top 50 sequences first. 2. For gene expression experiments. E.g. Under certain condition, biologists observed 300 genes whose expression increase more than 2 fold, among which 35 have expression increase more than 5 fold. In this case, the input sequence will have the upstream sequence of all 300 genes, and the 35 sequences with 5 fold change are in the front. Specify to search for candidate motifs from the top 35 sequences first. The basic strategy of MDscan is to search for motifs from high confident sequences first because in these sequences signal noise ratio is higher. MDscan is free to academia. Please do not distribute the executable of the program to others without a license or consent of the owners. For details about obtaining this program, please refer to: http://motif.stanford.edu/distributions/ Synopsis Usage: ./MDscan -i seqfile (options) Type: ./MDscan without any parameters gives you a short list on how to set parameters. Right now, the program only recogonize Restricted FASTA format: >sequence1 name ATGGTGACGAC sequence1 as ONE LINE >sequence2 name GTAGCCTCATG sequence2 as ONE LINE The input sequence order matters here: the more confident sequences should be put in the front. MDscan automatically consider both strands of each sequence. NOTE: If your input sequence (especially the top ones) have some repeat sequences such as TTTTTTTTTTTTTTT or ACACACACACACACACACACA, the program will fail. Actually the program will converge on these kind of non-sense motifs. So, you should remove these simple repeats before you run the program. Description of options -w -t -c You can have a input with 6000 sequences, but still look for candidate motif from top 50 sequences (-t 50), refine the motif with top 250 sequences (-c 250) and ignore the rest. -e If you expect to see one site every 1000 bases, then you can specify -e 1000. Of course, most of the time we don't know how frequent the motif occurs (we don't even know what it is), in which case don't specify -e at all. -f Precomputed background distribution (to speed up the program), which can be obtained by running the included program genomebg. Included are pre-computed yeast whole genome and yeast intergenic sequence distribution. To run genomebg, type: ./genomebg -i inputSequenceFile -o outputDistributionFile InputSequenceFile contains whole genome (or just intergenic) sequences in restricted FASTA format. The outputDistributionFile is the one you can use on BioProspector by specifying -f. -b If you want to use another sequence file which contains sequences that represent the background. Should be the same format as seqfile. -s Many candidate motifs will be generated, and only the good ones will be kept for refinment step. This number specifies how many the program should keep for refinement. -r After refinement, the best motifs are reported to the user as a TF binding motifs. This number specify how many final motifs will be reported. -n The candidate motif refinement step is done by iterations, and the motif usually converges within 10 iterations. If you want it to run longer till convergence, then you can specify another number. -o -g 1 During the run, the program prints out messages once in a while to report the progress of the program. If you don't want to see this (e.g. if you are running the program in the background), then specify -g 1. Example 1 ./MDscan -i inseq -b backseq -o out -g 1 & ==> find motif of width 10 from inseq. Use all the input sequences to calculate a genome background distribution. Search for the best 30 candidate motifs from top 5 sequences first, and refine them with all the sequences in inseq. Report the best 5 motifs to file out at the end. Run program in background, and don't print progress messages. Example 2 ./MDscan -i inseq -w 15 -f yeast_all.bg -t 10 -c 80 -r 10 -n 5 ==> find motif of width 15 from sequence file inseq, use yeast genome as the background distribution. Find candidate motifs from top 10 sequences, and refine 5 iterations from the top 0 sequences. Report the final best 5 motifs to stdout, and print out progress messages on the way. Output format 1. M-match calculated from background distribution (refer to paper, this is non-essential to biologists and only useful if you want to modify the program). 2. The best -r motifs reported: 1) Motif number, width, Motif score, number of aligned segments, Consensus, and Degenerate 2) Motif probability matrix (one line per motif column), Consensus, Reverse Consensus, Degenerate, Reverse degenerate, all in transfac format. 3) Sequence alignment: sequence name, starting alignment position of alignment (b53 means position 53 in reverse compliment direction, f47 is forward direction position 47), sequence of the aligned segments. Reference Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 2002 Aug;20(8):835-9. For questions, bug report, and request for source code, please contact X. Shirley Liu at xsliu@jimmy.harvard.edu. # NOTE: We are a newer version of MDscan called MDmodule, which is used in our new paper: Integrating Regulatory Motif Discovery and Genome-wide Expression Analysis (to appear on PNAS). MDmodule improved MDscan on the following: 1. MDmodule automatically deals with simple repeats in the input sequences. 2. If several final motifs share the same consensus, only the one with the highest score is kept. This way, you won't see the same motif coming up multiple times. 3. Once a motifs is found, MDmodule scans the whole input sequence file (say your input have 6000 sequences, you can -t 10 -c 100, and then scan the 6000 sequences using the motifs emerged) for more hits. 4. MDmodule does Monte Carlo simulation to give some statistics on the motif significance. This is very similar to the Monte Carlo simulatoin used in BioProspector. MDmodule plus its automation and regression code will be available once our paper is published.