Genetic Methods of Species Identification

How BLAST Works
What's the Chance of That Happening?
Making an Identification using BLAST

Are these Methods Reliable?

BLAST

BLAST stands for Basic Local Alignment Search Tool. Its purpose is to determine if there is a local region (of either nucleotides or amino acids) where one sequence aligns with another. When BLAST is used in species identification, one of the two sequences is the unknown, and the other sequence is usually drawn from large public databases of genetic sequences. There are millions of sequences in these public databases. This makes the problem even more complicated - where amongst all of these millions of sequences is there one for which there is a region of good alignment with our unknown sequence? Not a trivial task, and only one achievable using computers! Scientists then consider the local alignment to decide whether the unknown sequence may have come from the same species as the sequence from the database.

[TOP]

How BLAST Works

Elsewhere we showed a sequence alignment where the sequences were from the same region and were approximately the same length. But what if one sequence is several thousands or millions of nucleotides long while the other is only a few hundred? That is, can we find where in

. . . AACATTCGAAAGTCCCACCCACTAATAAAAATTGTAAACAATGCATTCATCGAC
CTTCCAGCCCCATCAAACATTTCATCATGATGAAATTTCGGTTCCCTCCTGGGAATCTGC
CTAATCCTACAAATCCTCACAGGCCTATTCCTAGCAATACACTACACATCCGACACAACA
ACAGCATTCTCCTCTGTTACCCATATCTGCCGAGACGTGAACTACGGCTGAATCATCCGA
TACATACACGCAAACGGAGCTTCAATGTTTTTTATCTGCTTATATATGCACGTAGGACGA
GGCTTATATTACGGGTCTTACACTTTTCTAGAAACATGAAATATTGGAGTAATCCTTCTG
CTCACAGTAATAGCCACAGCATTTATAGGATACGTCCTACCATGAGGACAAATATCATTC
TGAGGAGCAACAGTCATCACCAACCTCTTATCAGCAATCCCATACATCGGCACAAATTTA
GTCGAATGAATCTGAGGCGGATTCTCAGTAGACAAAGCAACCCTTACCCGATTCTTCGCT
TTCCATTTTATCCTTCCATTTATCATCATAGCAATTGCCATAGTCCACCTACTATTCCTC
CACGAAACAGGCTCCAACAACCCAACAGGAATTTCCTCAGACGTAGACAAAATCCCATTC
CACCCCTACTATACCATTAAGGACATCTTAGGGGCCCTCTTACTAATTCTAGCTCTAATA
CTACTAGTACTATTCGCACCCGACCTCCTCGGAGACCCAGATAACTACACCCCAGCCAAT
CCACTCAACACACCCCCTCACATCAAACCCGAGTGATACTTCTTATTTGCATACGCAATC
TTACGATCAATCCCCAACAAACTAGGAGGAGTACTAGCCCTAGCCTTCTCTATCCTAATT
CTTGCTCTAATCCCCCTACTACACACCTCCAAACAACGAAGCATAATATTCCGACCACTC
AGCCAATGCCTATTCTGAGCCCTAGTAGCAGACCTACTGACACTCACATGAATTGGAGGA
CAACCAGTCGAACACCCATATATCACCATCGGACAACTAGCATCTGTCCTATACTTTCTC
CTCATCCTAGTGCTAATACCAACGGCCGGCACAATCGAAAACAAATTACTAAAA . . .

that the sequence

ATGAATCTGAGGCGGATTCTCA

occurs?

If we are looking for an exact match then there are many ways to scan the sequence for a particular string of characters.

. . . AACATTCGAAAGTCCCACCCACTAATAAAAATTGTAAACAATGCATTCATCGAC
CTTCCAGCCCCATCAAACATTTCATCATGATGAAATTTCGGTTCCCTCCTGGGAATCTGC
CTAATCCTACAAATCCTCACAGGCCTATTCCTAGCAATACACTACACATCCGACACAACA
ACAGCATTCTCCTCTGTTACCCATATCTGCCGAGACGTGAACTACGGCTGAATCATCCGA
TACATACACGCAAACGGAGCTTCAATGTTTTTTATCTGCTTATATATGCACGTAGGACGA
GGCTTATATTACGGGTCTTACACTTTTCTAGAAACATGAAATATTGGAGTAATCCTTCTG
CTCACAGTAATAGCCACAGCATTTATAGGATACGTCCTACCATGAGGACAAATATCATTC
TGAGGAGCAACAGTCATCACCAACCTCTTATCAGCAATCCCATACATCGGCACAAATTTA
GTCGAATGAATCTGAGGCGGATTCTCAGTAGACAAAGCAACCCTTACCCGATTCTTCGCT
TTCCATTTTATCCTTCCATTTATCATCATAGCAATTGCCATAGTCCACCTACTATTCCTC
CACGAAACAGGCTCCAACAACCCAACAGGAATTTCCTCAGACGTAGACAAAATCCCATTC
CACCCCTACTATACCATTAAGGACATCTTAGGGGCCCTCTTACTAATTCTAGCTCTAATA
CTACTAGTACTATTCGCACCCGACCTCCTCGGAGACCCAGATAACTACACCCCAGCCAAT
CCACTCAACACACCCCCTCACATCAAACCCGAGTGATACTTCTTATTTGCATACGCAATC
TTACGATCAATCCCCAACAAACTAGGAGGAGTACTAGCCCTAGCCTTCTCTATCCTAATT
CTTGCTCTAATCCCCCTACTACACACCTCCAAACAACGAAGCATAATATTCCGACCACTC
AGCCAATGCCTATTCTGAGCCCTAGTAGCAGACCTACTGACACTCACATGAATTGGAGGA
CAACCAGTCGAACACCCATATATCACCATCGGACAACTAGCATCTGTCCTATACTTTCTC
CTCATCCTAGTGCTAATACCAACGGCCGGCACAATCGAAAACAAATTACTAAAA . . .

However, sequences will vary for many reasons - technical errors or sequence mutations - so we are interested in relaxing the search or alignment criteria a bit so that we can find any of the following highly similar regions:

ATGAATCTGAGGCGGATTCTCA
ATCAATCTGAGGCGATTCTCA
ATGAATATGAGGCGGATTTTCA

So, how does BLAST find those local alignments?

In the first step, BLAST searches the reference for an exact match with a small length of the unknown sequence. So if the minimum length were 11 nucleotides, and our unknown was

ATGAATATGAGGCGGATTTTCA

then BLAST would look for exact matches for each of

ATGAATATGAG
 TGAATATGAGG
  GAATATGAGGC
   AATATGAGGCG
    ATATGAGGCGG
etc.

An exact match will be found for

TGAGGCGGATT

BLAST then tries to extend the match in either direction. It assesses the quality of the match by calculating an alignment score, with so many points added for each nucleotide that matches and so many subtracted for each mismatch. BLAST also tries adding gaps, again with a penalty score. As the analysis reaches the edge of the region where an alignment is possible, there will be lots of mismatches so that the score will fall rapidly. When the score falls to a threshold, then the procedure ends. The result is that the local region of alignment has been identified.

In our example, BLAST proceeds to the ends of the unknown sequence, with a couple of mismatches before it hits the threshold. In the end, the entire fragment can be aligned with a region of the reference sequence, even though it is not a perfect match.

GTCGAATGAATCTGAGGCGGATTCTCAGTAGACAAAGCAACCCTTACCCGATTCTTCGCT
     |||||| ||||||||||| |||
     ATGAATATGAGGCGGATTTTCA

[TOP]

What's the Chance of That Happening?

But, you ask, couldn't that local alignment have happened by chance? Think "Monkeys at Typewriters", or nowadays, keyboards. If we look at enough DNA sequence over millions of sequences perhaps we will get a match just by chance alone. Does it really mean that these sequences have anything more in common, like ancestry or function?

To answer this question, the expect value (E value) is calculated. This is the probability that a match as good as the one you observed is expected to occur at random given the length of the match and the number of sequences in the database. Small sequence alignments have a higher chance of occurring at random. Also the more sequences that you compare, the greater is the chance of getting a random match.

The E value is a probability (range 0 - 1) and is usually written in exponential (scientific) notation:
0.0 = a very small number
e-15 = 10^-15 = 0.000000000000001
5e-3 = 5 x 10^-3 = 0.005

Alignments that have a very small E value are very unlikely to be due to chance alone, and are probably the results of some biological process, such as evolution.

[TOP]

Making an Identification using BLAST

To use BLAST to make a species identification, you need a reference database of sequences. Although it is possible to set up your own reference database, many scientists use the main public genetic repository known as "Genbank". This database is a joint effort of scientists worldwide, and contains genetic sequences determined for many different reasons by both large and small research groups. This database forms the basis of most modern molecular genetic and bioinformatics research. The database can be accessed through Web sites managed in the US and in Europe.

To perform a BLAST search:

Go to the BLAST Website www.ncbi.nlm.nih.gov/blast/
Select the nucleotide blast program to run.
Paste your unknown sequence into the Query Sequence box.
Select the Nucleotide collection (nr/nt) database to search.
Click the BLAST button.

After a while you will get a set of formatted results as illustrated below. This is an ordered list with the best at the top.

Accession   Description                                                 Max   Total   Query      E     Max 
                                                                       score  score  coverage  value  ident

EF061237.1  Bos frontalis isolate Dulong 14 cytochrome b (cytb) . . .   776    776     100%     0.0   100%
AY676873.1  Bos taurus isolate 32027 mitochondrion, complete genome     776    776     100%     0.0   100%
DQ186268.1  Bos taurus isolate Liping 3 cytochrome b gene, comp . . .   776    776     100%     0.0   100%

with the details of each alignment further down the page . . .

>gb|EF061237.1|  Bos frontalis isolate Dulong 14 cytochrome b (cytb) gene, complete 
cds; mitochondrial
Length=1140

 Score =  776 bits (420),  Expect = 0.0
 Identities = 420/420 (100%), Gaps = 0/420 (0%)
 Strand=Plus/Plus

Query  1    CTAATCCTACAAATCCTCACAGGCCTATTCCTAGCAATACACTACACATCCGACACAACA  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121  CTAATCCTACAAATCCTCACAGGCCTATTCCTAGCAATACACTACACATCCGACACAACA  180

Query  61   ACAGCATTCTCCTCTGTTACCCATATCTGCCGAGACGTGAACTACGGCTGAATCATCCGA  120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  181  ACAGCATTCTCCTCTGTTACCCATATCTGCCGAGACGTGAACTACGGCTGAATCATCCGA  240

Now you have to weigh the evidence, to make an identification. The list has different statistics on the quality of the match.

The two scores are an open-ended measure of the quality of the alignment. There is no external standard. All other things being equal, two alignments with the same scores are equally good.
The Query coverage is what proportion of the unknown or query sequence was aligned with one of the references. Greater coverage indicates a more meaningful alignment.
The E value as discussed above, is the probability of something that good happening by chance.
Max Ident indicates what proportion of the alignment involved identical nucleotides.

Generally the best match ("top hit") is the best estimate of species identity. You should be looking for nearly complete query coverage with very high sequence identity, and a very low E value.

Scroll down the list of hits to see if the statistics deteriorate as you move away from the top hit. If they do, then the top hit is probably the best identification.

If the statistics don't deteriorate as you move from the top hit, then you may not be able to make a definite identification. Remember that genetic sequences are inherited by species from their ancestors. This means that the starting state for two related species is to have identical genetic sequences at many places in their genomes. So the problem can be whether two species can be distinguished one from another or not. In the example above, the matches with sequences from two species of cattle (Bos frontalis and Bos taurus) are identically good. This result tells us that our unknown is a type of cattle, but not which species. Making species identifications using BLAST requires careful weighing of the evidence.

[TOP]

Home

Resources

Assignments

Report

Genetic Methods of Species Identification

BLAST

How BLAST Works

What's the Chance of That Happening?

Making an Identification using BLAST