MatCompare README Contact ======= Please address any questions or comments to Dustin Schones at dschones@cshl.edu Basic Information ================== For usage help, execute matcompare with no options or, matcompare --help MatCompare was designed to quantify the similarity between PFMs. The software is described in: Schones DE, Sumazin P, Zhang MQ, Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics. 2005 Feb 1;21(3):307-13. Basic Explanation ================== The original publication describing the MatCompare software explains how the similarity between PFMs is quantified using the Fisher contingency table test and the chi squared approximation of this test. A PFM is made up of an occurrence set, and two PFMs that model the same binding site motifs will contain occurrences of the true PFM. This true PFM can be described via molecular machinary, but in statistical term this PFM is made of the entire population of occurrences. The Fisher contingency test can be used to obtain the exact probability that the occurrences of 2 PFMs are occurrences of a 3rd PFM, and the chi squared test normaly gives a very good approximation for this probability. The statistical construct is less pratcial when the PFMs are made of very few or very many occurrences, or when the number of counts is unknown as in a PWM. When the number of occurrences is small (for example, less than 5) there is not enough data to nullify the hypothsis that the occurrences of the two PFMs originate from a single PFM. When the number of occurrences is very large (for example 500), there is enough data to nullify the hypothsis that the PFMs are made of binding sites of the same factor. This is due to the flexibility of factors and the fact that PFMs are not percise models of factor-DNA binding. To help overcome these problems we include several options that were not described in the original publication. Running the Progaram and Options ================================= Executing matcompare -l library motif-file compares each motif in motif-file to each motif (and the reverse complement) using the Kullback-Leibler divergence. If the two motifs are of the same length, each column in the first matrirx is compared to the coresponding column in the second matrix. If one of the motifs is shorter, that motif is compared to all possible starting positions in the other motif. The default threshold is set to call all divergences less than 1.0 as a match. To adjust the divergence threshold, the -t flag can be used. For example, matcompare -t 1.5 -l library motif-file will output all motif pairs with divergence scores less than 1.5. To compare motif fragments (allowing overhangs) use the -h flag. For example, matcompare -l library -h 2 motif-file This will allow each matrix to overhang the matrix it is being compared to by 2 positions. To use matcompare in "list mode", use the -L flag as in, matcompare -L -l library motif-file This will only return the names of the matrices compared and their comparison values. The default mode is "annotation mode" which will add Match scores for each matrix to the motif-file. To print results to an output file use the -o flag as in, matcompare -o OutputFile -l library motif-file This will print results to OutputFile. Optional Tests ============== The default test in MatCompare_1.1 is the Kullback-Leibler divergence. We consider motifs with divergence per column less than 1.0 as very similar and motifs with divergence per column greater than 1.5 as not similar. For more information about using the divergence test for PFM similarity quantification, see: Smith AD, Sumazin P and Zhang MQ, Identifying tissue-specific transcription factor binding sites in vertebrate promoters, PNAS 2005. The other tests available for comparisons are the chi squared test and the Fisher contingency table test. To use the chi squared test, use the -C flag as: matcompare -C -t 0 -l library motif-file For cases of comparing matrices with very few counts, we recommend using the Fisher contigency table test with the -F flag as: matcompare -F -t 0 -l library motif-file When using the chi squared or Fisher contingency table tests, remember to set the threshold as the default threshold is set for divergence tests. In the above examples the threshold is set to 0, allowing all comparisons with p-values greater than 0 to be reported. To compare modules files to a library, use the -M flag as: matcompare -M -l library module-file An example of the format for modules can be seen in the example directory in the file ExampleMod.mod. Examples ======== To compare the PFM in the file 'ExampleMat.mat' to the library of PFMs in the file 'ExampleLib.mat': matcompare -o output_file -L -t 1 -l ExampleLib.mat ExampleMat.mat produces the file 'output_file' which contains the comparison: TestMat-1 LibMat-2 0 Looking at the matrices in ExampleMat.mat and ExampleLib.mat, one can see the PFMs TestMat-1 and LibMat-2 are identical, thus the divergence value of 0. ~~~~~~~~~~~~~~~ To view the alignments of the matrix pairs that match, one can use the -a flag as in, matcompare -a AlignmentFile -t 1 -l ExampleLib.mat ExampleMat.mat The file AlignmentFile displays the alignment between TestMat-1 in ExampleMat.mat and LibMat-2 in ExampleLib.mat. Compiling ========== To compile the matcompare program, type: make in the directory with the source files. If the chi squared test is desired, the GNU scientific library must be installed and when compiling type: make GSL=1 Licensing ========= matcompare uses the GNU scientific library which is distributed under the GNU General Public License. For more information refer to: http://www.gnu.org/copyleft/gpl.html =============================================================== MatCompare -- Copyright (C) 2005 Cold Spring Harbor Laboratory ===============================================================