Software Tools :: Motifs, Patterns and Profiles
The programs in this section let you look for simple motifs that can be
expressed in terms of a text expression (for example, TATAAT), simple
patterns ("the amino acids DLL and the amino acids PD, separated by 10 amino
acids of any type"), patterns characteristic of known functional regions (signal
peptide cleavage sites, helix-turn-helix motifs), and complex
patterns that can be adequately expressed only by means of weight matrices or
profiles. You can search a single sequence or a database of sequences to find
the locations of motifs or you can compare a sequence to a database of
motifs in order to see which motifs are present in the sequence.
There are two commonly used "languages" for expressing simple patterns:
regular expressions (derived from computer programs used to process text data)
and PROSITE format (the pattern language used by the PROSITE Dictionary of
Protein Sites and Patterns). Either will allow you to express ambiguity at a
given position and specify that specific motifs may be separated by a variable
number of residues. For example, you can use either language to describe a
pattern consisting of an acidic amino acid (D or E) followed by two hydrophobic
amino acids (selected from L, I, V, M, or F) followed by between 10 and 12
amino acids of any type, followed by a small amino acid (P, A, or G).
A number of databases exist that contain known patterns of this simple
type. The most commonly known are the PROSITE and PRINTS databases of protein
motifs and the TRANSFAC database of transcription factor binding sites.
Next in complexity are motifs that can't be expressed easily by simple
textual patterns. Specalized programs exist for finding regions in sequences
that have patterns characteristic of known functional domains, such as signal
sequence cleavage sites, CpG islands, and MAR/SAR sites.
The most complex and sensitive methods for expressing patterns are profile
methods. Profiles are derived from regions of aligned sequences that contain
conserved residues. The simplest form is a weight matrix (frequency matrix);
more complex forms are Gribskov profiles and hidden Markov model (HMM)
profiles. Databases exist that contain Gribskov or HMM profiles characteristic
of known protein families; the most useful of these is the Pfam (Protein
Families) database of HMM profiles.
There are also programs that will examine a set of unaligned sequences to
see if they contain any motifs in common. These programs are computationally
intensive and can run for hours if they are examining a large set of
sequences.
Back to
Motifs
This website will look much better in a browser that supports
web standards, but it has been designed so
that it is still usable and accessible to any browser or web-enabled device.
|