fasta - scan a protein or DNA sequence library for similar sequences
tfasta - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.
lfasta - compare two protein or DNA sequences for local similarity and show the local sequence alignments
plfasta - compare two sequences for local similarity and plot the local sequence alignments
fasta [ -a -b # -c # -d # -f # -g # -l FASTLIBS -r STATFILE -m # -o -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1 ] query-sequence-file library-file [ ktup ]
fasta [-Qabcdfghiklmnoprswxyz] query-file @library-name-file
fasta [-Qabcdfghiklmnoprswxyz] query-file "%PRMVI"
fasta [-abcdglmnoprswxy] - interactive mode
tfasta [-abcdfgkmoprsw3] protein-query-file DNA-library [ ktup ]
lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]
plfasta [-afgmnpsxv] sequence-file-1 sequence-file-2 [
ktup ]
fasta is used to compare a protein or DNA sequence to all of the entries in a sequence library. For example, fasta can compare a protein sequence to all of the sequences in the NBRF PIR protein sequence database. fasta will automatically decide whether the query sequence is DNA or protein by reading the query sequence as protein and determining whether the `amino-acid composition' is more than 85% A+C+G+T. fasta uses an improved version of the rapid sequence comparison algorithm described by Lipman and Pearson (Science, (1985) 227:1427) that is described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444. The program can be invoked either with command line arguments or in interactive mode. The optional third argument, ktup sets the sensitivity and speed of the search. If ktup=2, similar regions in the two sequences being compared are found by looking at pairs of aligned residues; if ktup=1, single aligned amino acids are examined. ktup can be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA sequences. The default if ktup is not specified is 2 for proteins and 6 for DNA.
fasta compares a query sequence to a sequence library which consists of sequence data interspersed with comments, see below. Normally fasta and tfasta search the libraries listed in the file pointed to by the environment variable FASTLIBS. The format of this file is described in the file FASTA.DOC. tfasta compares a protein sequence to a DNA sequence database, translating the DNA sequence library in 6 frames `on-the-fly' (3 frames with the -3 option). The search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by default. tfasta searches a DNA sequence database in the standard text format described below.
lfasta and plfasta programs compare two sequences looking for local sequence similarities. While fasta and tfasta report only the best alignment between the query sequence and the library sequence, lfasta and plfasta will report all of the alignments between the two sequences with scores greater than a cut-off value. lfasta shows the actual local alignments between the two sequences and their scores, while plfasta produces a plot of the alignments that looks similar to a `dot-matrix' homology plot. On Unix systems, plfasta generates tektronix output that can either be displayed on a tektronix terminal or piped through the tek2ps program for output on the laser printer. On MS-DOS systems, plfasta uses the graphics capabilities of the computer screen together with the *.BGI graphics device drivers supplied by Borland with Turbo `C'.
The fasta programs use a standard text format sequence file. Lines beginning with '>' or ';' are considered comments and ignored; sequences can be upper or lower case, blanks, tabs and unrecognizable characters are ignored. fasta expects sequences to use the single letter amino acid codes, see protcodes(1). Library files for fasta should have the form shown below.
With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and SSEARCH now provide estimates of statistical significance for library searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and others (see Altschul et al. (1994) Nature Genetics 6:119 for an excellent review) suggests that local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the lengths of the query and library sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence. FASTA and SSEARCH use simple linear regression against the the log of the library sequence length to calculate a normalized "z-score" with mean 50, regardless of library sequence length, and variance 10. These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided Phil Green, U. Washington. This version of FASTA and SSEARCH uses a slightly different strategy for fitting the data than those originally provided by Dr. Green.
The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*'s" in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by FASTA and SSEARCH. For FASTA, if optimized scores are calculated for each sequence in the database (-o option), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of SSEARCH Smith- Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates.
The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not the case, then the expectation values are meaningless. Likewise, if there are fewer than 20 sequences in the library, the statistical calculations are not done.
For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias. The programs below, PRDF and PRSS, can address this problem by calculating similarity scores for random sequences with the same length and amino acid composition.
Unless, optimization is used "-o", E-values for DNA sequences overestimate the significance of the scores that are obtained and unrelated sequences frequently have E()-values < 0.0005. With optimization, the agreement between E()-value compares favorably with protein sequence comparison. This is in part due to the use of more stringent gap penalties for DNA sequence comparison, -16, -4 rather than -12, -2. With the latter penalties, many unrelated sequences appear to have significant similarity. Nevertheless, since protein sequence comparison is much more sensitive, DNA sequence comparison should not be used to identify sequences that encode protein. Even with ktup=6, optimization rarely increases run-times more than 50% with mRNA-size query sequences. Optimization should be used whenever possible.
Similar comments apply to TFASTA, where higher gap penalties (-16,-4) are required for accurate statistical estimates. Because TFASTA produces so many artificial "coding" sequences with atypical amino acid compositions, the statistical estimates with TFASTA are often over estimates. With optimized scores, ktup=1, and gap penalties of -16, -4, unrelated sequences will sometimes have E() values of 0.1. If initn scores are used, unrelated sequences may have have E() values < 0.01.
fasta and the other programs can be directed to change the scoring matrix, search parameters, output format, and default search directories by entering options on the command line (preceeded by a `-' or `/' for MS-DOS). All of the options should preceed the file name and ktup arguments). Alternately, these options can be changed by setting environment variables. The options and environment variables are:
MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4
MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
MWKSCGYPYT MWKSCGYPYT
(1) fasta musplfm.aa $AABANK
Compare the amino acid sequence in the file musplfm.aa with the complete PIR protein sequence library using ktup=2. Each "library" sequence (there need only be one) should start with a comment line which starts with a '>', e.g.
>LCBO bovine preprolactin
WILLLSQ ...
>LCHU human ...
...
(2) fasta -a -w 80 musplfm.aa lcbo.aa 1
Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa using ktup=1. Show both sequences in their entirety, with 80 residues on each output line.
(3) fasta
Run the fasta program in interactive mode. The program will prompt for the file name for the query sequence, list alternative libraries to be seached (if FASTLIBS is set), and prompt for the ktup.
This version of fasta prompts for the library file to be searched from a list of file names that are saved in the file pointed to by the environment variable FASTLIBS. If FASTLIBS = fastgb.list, then the file fastgb.list might have the entries:
NBRF Protein$0P/u/lib/aabank.lib 0
GB Primate$1P@/u/lib/gpri.nam
GB Rodent$1R@/u/lib/grod.nam
GB Mammal$1M@/u/lib/gmammal.nam
Each line in this file has 4 fields: (1) The library name, separated from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein or DNA library respectively; (3) A single letter that will be used to choose the library; (4) the location of the library file itself (the library file name can contain an optional library format specfier. fasta recognizes the following library formats:
</usr/slib/genbank (the directory for the library files)
>glocus.idx (index file for GENBANK binary files)
gpri1.seq 9
gpri2.seq 9
gpri3.seq 9
...
grod1.seq 9
...
This version of fasta can also distinguish between normal text library files (as shown above in EXAMPLE (2)), and DNA libraries in the GENBANK compressed floppy disk format. These latter files are binary files that are distributed by Intelligenetics on floppy disks. Earlier versions of fasta (and fastn before it) used different programs to read the text library files (old fasta or ifastn) and the compressed files (old fastgb and gfastn). These routines have been combined in the current fasta.
You can use your own sequence files for fasta, just be certain to put a '>' and comment as the first line before the sequence. Only one library file type, the standard NBRF library format, is supported by the VAX/VMS programs. lfasta and plfasta do not required the '>' and comment line. fasta does.
AUTHOR: William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA
wrp@virginia.EDU
"As always, please inform me of bugs as soon as possible."
This HTML was document orginally created by Tod M. Klingler, Stanford Unoversity