Software Tools :: Motifs, Patterns and Profiles

The programs in this section let you look for simple motifs that can be expressed in terms of a text expression (for example, TATAAT), simple patterns ("the amino acids DLL and the amino acids PD, separated by 10 amino acids of any type"), patterns characteristic of known functional regions (signal peptide cleavage sites, helix-turn-helix motifs), and complex patterns that can be adequately expressed only by means of weight matrices or profiles. You can search a single sequence or a database of sequences to find the locations of motifs or you can compare a sequence to a database of motifs in order to see which motifs are present in the sequence.

There are two commonly used "languages" for expressing simple patterns: regular expressions (derived from computer programs used to process text data) and PROSITE format (the pattern language used by the PROSITE Dictionary of Protein Sites and Patterns). Either will allow you to express ambiguity at a given position and specify that specific motifs may be separated by a variable number of residues. For example, you can use either language to describe a pattern consisting of an acidic amino acid (D or E) followed by two hydrophobic amino acids (selected from L, I, V, M, or F) followed by between 10 and 12 amino acids of any type, followed by a small amino acid (P, A, or G).

A number of databases exist that contain known patterns of this simple type. The most commonly known are the PROSITE and PRINTS databases of protein motifs and the TRANSFAC database of transcription factor binding sites.

Next in complexity are motifs that can't be expressed easily by simple textual patterns. Specalized programs exist for finding regions in sequences that have patterns characteristic of known functional domains, such as signal sequence cleavage sites, CpG islands, and MAR/SAR sites.

The most complex and sensitive methods for expressing patterns are profile methods. Profiles are derived from regions of aligned sequences that contain conserved residues. The simplest form is a weight matrix (frequency matrix); more complex forms are Gribskov profiles and hidden Markov model (HMM) profiles. Databases exist that contain Gribskov or HMM profiles characteristic of known protein families; the most useful of these is the Pfam (Protein Families) database of HMM profiles.

There are also programs that will examine a set of unaligned sequences to see if they contain any motifs in common. These programs are computationally intensive and can run for hours if they are examining a large set of sequences.

Back to Motifs

This website will look much better in a browser that supports web standards, but it has been designed so that it is still usable and accessible to any browser or web-enabled device.