PhyDesign FAQ

FAQ:

What does PhyDesign compute?

PhyDesign uses the phylogenetic informativeness method to quantitatively measure the informativeness (power) of a gene (or a set of characters) to resolve branching order in a particular epoch in a phylogenetic tree. Phylogenetic informativeness has successfully recapitulated the qualitative utility of genes in known data sets:

Townsend, J. P. 2007. Profiling phylogenetic informativeness. Systematic Biology 56(2): 222-231.[Original paper]

Townsend, J.P., F. López-Giráldez, R. Friedman, 2008. The phylogenetic informativeness of nucleotide and amino acid sequences for reconstructing the vertebrate tree. Journal of Molecular Evolution 67(5): 437-447.

Schoch, C.L., G-H Sung, F. López-Giráldez, J.P. Townsend, (5 authors), Z. Wang, (53 authors), and J.W. Spatafora, 2009. The Ascomycota Tree of Life: A Phylum-wide Phylogeny Clarifies the Origin and Evolution of Fundamental Reproductive and Ecological Traits. Systematic Biology 58(2):224-239.

What kind of questions can one answer by measuring phylogenetic informativeness?

The main questions one can answer include the informativeness of different classes of phylogenetic characters, the utility of increased taxonomic versus character sampling, the ability to differentiate between lack of signal and adaptive radiation, and the design of taxonomically broad studies optimized by taxonomically sparse genome-scale data. The goal of this project is to develop practical analytical methods for predicting the informativeness of taxonomic characters for specific historical eras.

This method needs prior data to get the phylogenetic informativeness profiles. Where can I obtain this data?

To estimate phylogenetic informativeness, prior data on the molecular evolutionary pattern of a gene is required. This prior information may be derived from three potential sources:
1) preliminary data on the candidate genes from a well-studied subset of the taxa of interest; 2) data on the candidate genes from a well-studied sister clade; or 3) comparative genomic data from sequenced genomes within and/or outside the clade of interest. Sequence alignments and known topologies may be obtained from published data or Tree of Life project databases.

Are there known issues with this method of profiling?

Many chronic issues in phylogenetic analysis, such as nonstationerity of base frequencies and rate variation among lineages, are not specifically accounted for in the phylogenetic informativeness measure. In general, the profile of phylogenetic informativeness gives an idea of the predicted phylogenetic signal for a given gene, but it does not quantify noise; thus, when interpreting the profiles be sure to consider how noise from fast-evolving sites may or may not affect the results acheived. We are currently working on a way to quantify noise with PhyDesign, but it is not yet ready for use.

Can this method of profiling be improved?

We are still advancing the theoretical framework of phylogenetic informativeness, with the goal of extending current theoretical work on phylogenetic experimental design, expanding the current phylogenetic informativeness methodology to quantitate the effects of noise (parallelism and convergence), and prioritizing utilizing taxon addition by topological location in comparison to expanding the number of markers sequenced.

Do I have to upload gene by gene, or can I analyze more than one gene at a time?

You can analyze and get profiles for more than one gene at a time by using the Nexus format and setting partitions like in the MrBayes block. Three commands are needed in a BEGIN SETS block: i) "set partition"; ii) listing partition names with "partition"; and iii) listing the coordinates for each partition with "charset".

Can I get phylogenetic informativeness profiles for morphological data?

Yes; you can get the rates for each character by using BayesTraits program. However, please note that morphological characters are usually chosen a priori to be informative and relevant regarding the phylogenetic problem at hand. Thus, their signal is likely to be higher than predicted by "route" alone.

Can informative values of DNA and Protein be directly compared?

In comparing profiles from AA and DNA sequence, keep in mind that AA have a greater state space (see Simmons et al. 2004 ), so that they tend to be less subject to noise (unmodeled by Townsend, 2007) than DNA. We are currently working on a way to quantify noise with PhyDesign, but it is not yet ready for use.

How to cite PhyDesign?

Please, cite the following article:

Lopez-Giraldez F., and J.P. Townsend, 2010. PhyDesign: a webapp for profiling phylogenetic informativeness. [unpublished]

and the program used to calculate the rates:

Pond, S.L.K., Frost, S.D.W., and S.V. Muse, 2005. Hyphy: hypothesis testing using phylogenies. Bioinformatics, 21(5), 676–9.

Olsen, G. J., unpublished. DNArates.

Mayrose, I., Graur, D., Ben-Tal, N., and T. Pupko, 2004 Comparison of Site-Specific Rate-Inference Methods for Protein Sequences: Empirical Bayesian Methods Are Superior. Mol. Biol. Evol., 21(9), 1781-91.

What alignment formats does PhyDesign accept?

Currently, PhyDesign accepts three different formats: Fasta, Nexus, and Phylip.

What tree formats does PhyDesign accept?

Currently, PhyDesign only accepts ultrametric trees in Newick format.

What formats are acceptable for the site rate form?

If you already know the rates for each site in your alignment, you can input these rates in the site rate form by clicking at "Instead, input a site rate file". The site rate file format consists of the name of a locus, followed by a colon, and the rates, separated by commas. Each locus should be entered in one line.

Example:

gene1:0.026,0.265,1.236,.......,0.698
gene2:0.046,0.002,0.014,.......,0.972
gene3:0.667,0.665,0.748,.......,0.987

A site rate file with the right format can also be obtained from the PhyDesign web page after uploading an ultrametric tree and an alignment. Thus, once you run the analysis for the first time, you do not have to run it again; instead, you can input the site rate file in the second form and obtain the phylogenetic informativeness profiles directly.

Why do you offer different programs?

After uploading the alignment and the tree files, the user can choose a program from the drop-down menu with which to obtain the substitution rates. Once, a program has been chosen, it is possible to access to advance options where the user will be offered with different evolutionary models and parameters. For DNA sequences, we recommend to use of HyPhy, for which a HyPhy batch file was created to implement all time-reversible models. Unlike DNArates, HyPhy also accepts multifurcating trees. For amino acid sequences, rate4site is provided.

What does the Y-axis values on the phylogenetic informativeness profile represent?

It is a normalized, asymptotic likelihood density for a true synapomorphy occurring in an asymptotically short, deep internode at historical time T of a quartet of taxa under an infinite-states Poisson model of character evolution.

helpful? let's expand it a bit more... The key words are "normalized" and "likelihood density". Because it is a normalized likelihood density, the integral of the informativeness is always one. What this means is that the height of the y-axis is not important except for comparisons between partitions for the same time period.

That's still probably pretty obstruse. Practically, what this means is that the heights of your profiles depend linearly on the unit you use to measure time in your chronogram. If your branch lengths of your chronogram are numerically high (say, you quantify time in minutes or seconds), then your informativeness y-axis will have tiny magnitudes. If the branch lengths are numerically low (say, you quantify time in units of billions of years), then your informativeness y-axis will have enormous magnitudes. If you use units of molecular evolution that are chronometric but are not calibrated to an "absolute" time scale, the same rule holds -- if the branch lengths are numerically low, the informativeness y-axis will have large magnitude, and if the branch lengths are numerically high, the informativeness y-axis will have low magnitude. It is perfectly fine to change or scale the unit of your chronometric tree so that the y-axis adopts a range that suits your preference.

What is the difference between 'net' and 'per site' phylogenetic informativeness?

For each locus, one can calculate the net and per site phylogenetic informativeness. The net phylogenetic informativeness is normally the prediction of interest, as it should correlate with empirical results, such as the degree of support of a node. However, when noise is an issue, signal density is converged by the phylogenetic informativeness per site. Phylogenetic informativenes per site is relevant because the cost vs. benefit of sequencing and analysis may be quantified with such a measure and because compares relative power of genes without the confounding influence of gene length. For example, one gene may show "good" net profiles, but there may be shorter genes (requiring less sequencing effort) which may show better per site profiles. Combining of shorter genes with a sequencing effort equal to that of a longer gene can lead to a better phylogenetic informativeness.

What kind of information can I extract from integrating over a specific epoch?

By integrating over specific epochs, one obtains the area below the profiles. The areas below the profiles allow us to rank genes based on the phylogenetic signal for that epoch. Integration values will be largest for the genes that have the highest probability of exhibiting substitutions in the given epoch that will not be obscured in subsequent branches. Integration does not account for noise.

Why are the graphics in SVG format?

Because the SVG plots produced in this web allows to the user to have high quality graphics ready for publication in any format. In addition, they can be further modified/improved easily with any vector graphics program, such as Adobe Illustrator or the Open Source editor Inkscape.

How can I test if there is a significat difference between phylogenetic informativeness values?

We have not yet developed such a test. In principal, however, it is the same question as whether the sites exhibit different rates, which may be addressed by asking whether partitions should have different site rate distribution (e.g., different gamma distribution).

What is the high spike close to time 0 present in some profiles? How to interpret those results?

Those really recent "phantom" spikes arise because the maximum likelihood estimate for the rate of a few sites has its peak at infinity (rather unrealistic). The software that PhyDesign calls to estimate the rates smacks up against its hard-coded limit, so those few sites all are estimated to evolve at one very fast rate, leading to a spike that has little biological meaning. They can be caused by sites with indels or ambiguous sequence calls. Generally speaking, greater taxon sampling will help to better estimate the rate at those sites and thus draw down that peak until it disappears. In our experience, subsampling from a larger dataset, the profile after greater sampling is essentially the same as the one before, except the really recent peak is diminished or gone with the larger dataset. Thus, it seems that the best thing to do is to exclude those sites on the justifiable grounds that their rate is simply not well estimated by maximum likelihood. One can identify the poorly estimated rates in the rates file supplied by PhyDesign.

What does the error "No partition name list was found for [partition_name]." mean?

This error occurs when you upload a Nexus alignment with a 'Set' or MrBayes block describing characters sets (charset command) to analyze multiple data partitions at the same time. It arises when a partition is set but a list with the partition names is not present or the names have been misspelled.

What does the error "# of partitions [#] different from # of partition names[#]." mean?

This error occurs when you upload a Nexus alignment with 'Set' or MrBayes block to analyze multiple data partitions at the same time. It arises when a partition is set, but the number of partition names doesn't match the number of partitions indicated.

As indicated in the MrBayes wiki, "The elements of the partition command are: (1) the name of the partitioning scheme; (2) an equal sign (=); (3) the number of character divisions in the scheme; (4) a colon (:); and (5) a list of the characters in each division, separated by commas."

What does the error "# of partitions [#] different from # of charset found [#]." mean?

This error occurs when you upload a Nexus alignment with 'Set' or MrBayes block to analyze multiple data partitions at the same time. It arises when a partition is set, but the number of partition names doesn't match the number of coordinates found using the 'charset' command. It can also arise due to misspelling the partitions.

FAQ:

General questions:

Input formatting:

Site rate estimation:

Profiling:

Errors: