DNA-surveillance - Species Identification with DNA

What Rat is That?	Home	About	How to Use	The Science	Links and Publications	Data Ownership
Search	Cluster (Simple)	Cluster (Advanced)	Maximum Likelihood	Example Data

How do I use What Rat is That?
Submitting a Sequence
Reference Datasets
IUPAC Nucleotide Codes
Sequence Alignment
Calculation of Evolutionary Distances
Building the Phylogenetic Tree
Advanced search and bootstrapping
- Bootstrapping
- Emailed response
Maximum Likelihood Analysis
The Results
Issues of Interpretation
References

How do I use What Rat is That?

If you have a tissue sample from a rat, you can obtain an identification of the species in two steps:

Use standard molecular laboratory techniques to obtain nucleotide sequence from the mtDNA control region (5'end), mtDNA cytochrome b (5'end) OR mtDNA cytochrome oxidase I (5'end).
Submit the sequence to this site and select the appropriate reference sequence dataset for comparison. An advanced cluster search option gives you the opportunity to perform a bootstrap analysis, while the maximum likelihood will perform more rigorous statistical analyses in placing your query sequence on the tree. Both the advanced cluster and maximum likelihood options will send you the results by email.

Submitting a Sequence

To submit a sequence for analysis:

click on the Simple search link
paste your sequence into the Data Entry window
select the reference dataset and the genomic locus
click on the Submit button

Your sequence must be either in FASTA format or as a text nucleotide sequence. Use either UPPER or lowercase. For example:

>mysample
ACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAGCTGAAGGAATC
GTAGAAATTAAACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAG
CTGAAGGAATCTGTAGAAATTAA

ACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAGCTGAAGGAATC
GTAGAAATTAAACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAG
CTGAAGGAATCTGTAGAAATTAA

Only one sequence may be submitted at a time.

If your sequence contains illegal characters, that is those not included in the IUPAC ambiguity codes, then it will be rejected with an error message. If your sequence does contain any of the ambiguity codes, then they will be used both in aligning the sequence and in calculating evolutionary distances.

Your sequence will be analysed automatically. Please wait about 15 seconds and then click the Retrieve Results button to view your results. It will take longer for results to become available if full alignment and/or bootstrap resampling are requested.

Reference Datasets

Domain	Cytochrome Oxidase I (v2) = COI_v2	Cytochrome b (v2) = CytB_v2	Control Region (v2) = DLoop_v2	Cytochrome Oxidase I (v3) = COI_v3	Cytochrome b (v3) = CytB_v3	Control Region (v3) = DLoop_v3
Rattus	Link	Link	Link	Link	Link	Link
Pos 1-250	N/A	N/A	N/A	N/A	N/A	Link
Pos 1-200	N/A	N/A	N/A	Link	N/A	N/A
Pos 101-300	N/A	N/A	N/A	Link	N/A	N/A
Pos 201-400	N/A	N/A	N/A	Link	N/A	N/A
Pos 301-500	N/A	N/A	N/A	Link	N/A	N/A
Pos 401-600	N/A	N/A	N/A	Link	N/A	N/A
Pos 501-end	N/A	N/A	N/A	Link	N/A	N/A

IUPAC Nucleotide Codes

Ambiguous	Symbol	Meaning	Origin of designation
	G	G	Guanine
	A	A	Adenine
	T	T	Thymine
	C	C	Cytosine
	U	U	Uracil
X	R	G or A	puRine
X	Y	T or C	pYrimidine
X	M	A or C	aMino
X	K	G or T	Keto
X	S	G or C	Strong interaction (3 H bonds)
X	W	A or T	Weak interaction (2 H bonds)
X	H	A or C or T	not-G, H follows G in the alphabet
X	B	G or T or C	not-A, B follows A
X	V	G or C or A	not-T (not-U), V follows U
X	D	G or A or T	not-C, D follows C
X	N	G or A or T or C	aNy

Sequence Alignment

The sequence input by the user is aligned with the chosen reference set of sequences by a simple profile alignment (Gribskov et al. 1987, 1990, Gribskov and Veretnik 1996). Clustal X implements a more sophisticated method which allows the user to specify local gap costs and other parameter values (Thompson et al. 1997). To optimize system performance, the reference sequences have been prealigned.

The parameters used in the alignment are displayed with the dataset information.

Calculation of Evolutionary Distances

The evolutionary distances among all of the aligned sequences, reference and submitted, are then calculated using the F84 model (Felsenstein 1984; Kishino and Hasegawa 1989). The parameter values used are those displayed with the dataset information.

Building the Phylogenetic Tree

A phylogenetic tree is build to include the members of the reference set of sequences chosen by the user and the sequence that the user has submitted. The tree is built using theNeighbor-Joining (NJ)algorithm (Saitou and Nei 1987) and rooted using an outgroup appropriate for each data set.

Advanced search and bootstrapping

The Advanced search window adds additional functions to the search process:

Bootstrapping

To perform a bootstrap analysis:

click on the Advanced search link
paste your sequence into the Data Entry window
select the reference dataset and genomic locus
select the number of bootstrap replicates you require
optionally enter an email address to which the results will be sent
click on the Submit button

A bootstrap analysis will take longer than a simple search. The length of time will depend on the number of pseudoreplicates you have chosen and on the load on our server. Your screen will be refreshed about every 10 seconds, or you can choose to have the results sent to you by email.

Emailed response

You can choose to have the results sent to you by email. If you enter an optional email address, you can close your browser once the search has been submitted.

Maximum Likelihood Analysis

The reference alignment, and the associated phylogenetic tree, are considered to be prior knowledge about the relationships among the reference organisms. Potentially the query sequence can be joined to that tree on any branch. We seek the connection point that has the highest statistical likelihood, thereby giving the maximum likelihood estimate of the relationship between the query and reference sequences. The maximum likelihood connection point is represented in the output by a dashed branch. For a particular connection point the determined likelihood score is the maximum likelihood estimate under the associated topology (that is, all the branch lengths are re-optimised for each connection point).

The Shimodaira-Hasegawa (SH) test is used for assessing a confidence limit on the connection point with the highest expected likelihood. The expected likelihood of a connection point is the expectation of likelihood under the true process of evolution (as a random variable). The SH test calculates such a confindence limit by simulating replicate datasets under an approximation of the least configurable configuration (LFC) in which is that all connection points have equivalent expected likelihoods, and comparing the observed differences in likelihood with the expected distribution of likelihoods under the LFC.

The utilised implementation of the SH test simulates 1000 non-parametric bootstraps, and uses the RELL (Shimodaira and Hasegawa 1999) approximation. Branches that represent connection points within the confidence limit are colour red. A critical value of = 0.05 is used (95% confidence limit).

The Results

The results will be displayed first as a phylogenetic tree in which the differences between sequences are proportional to the lengths of the horizontal branches separating the tips. The names of the reference species are colour-coded to help you identify close relatives. To save a copy of the tree as a PNG-format file, right-click (PC) or control-click (Mac) on the image and choose Download Image to Disk, or similar, from the pop-up menu.

If you have performed a bootstrap analysis, the resulting phylogenetic tree will display numbers at some of the nodes. These numbers are the percentage of bootstrap pseudoreplicates that contain the clade formed by the subtree starting at that node. This measure of bootstrap supportis displayed only when at least 50% of the pseudoreplicates contain the clade. The phylogenetic tree displayed is the estimated tree, and not the consensus of the bootstrap pseudoreplicate trees.

If you scroll further down past the tree, you will also find a table showing the evolutionary distances between the user-submitted sequence and each of the sequences in the reference set. Sites having IUPAC ambiguity codes are included in the calculation of evolutionary distances. To save the contents of the tableto disk, select all of the table, copy it, open a text file document on your computer (eg Notepad or SimpleText) and then paste it in.

If you scroll further down further again, there is a text version of the phylogenetic tree in Newick format. To save this to disk, select the contents of the text box in which it is displayed, open a text file document on your computer (eg Notepad or SimpleText) and then paste it in.

You can fine-tune your analysis by clicking on the Submit a sequence link to return to the Data Entry page, where you can choose a different reference set.

Issues of Interpretation

What species is it?

One of our motivations in establishing this site is to provide a phylogenetic definition of the rats in the region. The morphology-based taxonomy has undergone many changes, and the assignment of a name to a specimen can be very confusing. We report both the species name as found on the specimen label, and also give the name of the clade in which that reference sequence fell in our reference phylogeny. We have found strong incongruence between the nominal species identities of specimens and the phylogenetic clades in which they fall. We have not chosen the reference sequences to represent the nominal diversity of a clade, just its genetic diversity. Consequently we strongly suggest that the name of the clade be used to make an operational identification of the specimen.

Is it a rat?

What Rat is That? is an online service for the identification of Rattus spp. by phylogenetic analysis. Its scope is limited to the members of this genus in southeast Asia and the Pacific region, and any submitted sequence will be treated as if it were derived from a rat. A simple system has been implemented to flag sequences which might give unreliable results. Nevertheless, it remains the responsibility of the user to decide whether a phylogenetic analysis is appropriate in their individual case. The user should also seek other evidence to corroborate that any DNA sequence which they submit is actually of rat origin, perhaps by searching Genbank.

Poorly Resolved Species Groups

Some of the phylogenetic clades, especially PNG I, PNG II, and PNG IIII, are not well structured. Our subdivision of this clade into these smaller clades is unsatisfactory, and will be the subject of ongoing sampling and study so that a better resolution may be found.

Missing Species

The tables summarising the species represented in each science(mtDNA control region, cytochrome b, and cytochrome oxidase I) should be consulted prior to making any conclusions regarding the identity of any sample.

References

See science.

DNA Surveillance

Species identification with DNA

Contents