What kind of questions can one answer by measuring phylogenetic informativeness?
The main questions one can answer include the informativeness of different classes of phylogenetic characters, the utility of increased taxonomic versus character sampling, the ability to differentiate between lack of signal and adaptive radiation, and the design of taxonomically broad studies optimized by taxonomically sparse genome-scale data. The goal of this project is to develop practical analytical methods for predicting the informativeness of taxonomic characters for specific historical eras.
This method needs prior data to get the phylogenetic informativeness profiles. Where can I obtain this data?
To estimate phylogenetic informativeness, prior data on the molecular evolutionary
pattern of a gene is required. This prior information may be derived from three potential sources:
1) preliminary data on the candidate genes from a well-studied subset of the taxa of interest;
2) data on the candidate genes from a well-studied sister clade; or
3) comparative genomic data from sequenced genomes within and/or outside the clade of interest.
Sequence alignments and known topologies may be obtained from published data or Tree of Life project databases.
Are there known issues with this method of profiling?
Many chronic issues in phylogenetic analysis, such as nonstationerity of base frequencies and rate variation among lineages, are not specifically accounted for in the phylogenetic informativeness measure. In general, the profile of phylogenetic informativeness gives an idea of the predicted phylogenetic signal for a given gene, but it does not quantify noise; thus, when interpreting the profiles be sure to consider how noise from fast-evolving sites may or may not affect the results acheived. We are currently working on a way to quantify noise with PhyDesign, but it is not yet ready for use.
Can this method of profiling be improved?
We are still advancing the theoretical framework of phylogenetic informativeness, with the goal of extending current theoretical work on phylogenetic experimental design, expanding the current phylogenetic informativeness methodology to quantitate the effects of noise (parallelism and convergence), and prioritizing utilizing taxon addition by topological location in comparison to expanding the number of markers sequenced.
Do I have to upload gene by gene, or can I analyze more than one gene at a time?
You can analyze and get profiles for more than one gene at a time by using the Nexus format and setting partitions like in the MrBayes block. Three commands are needed in a BEGIN SETS block: i) "set partition"; ii) listing partition names with "partition"; and iii) listing the coordinates for each partition with "charset".
Can I get phylogenetic informativeness profiles for morphological data?
Yes; you can get the rates for each character by using BayesTraits program. However, please note that morphological characters are usually chosen a priori to be informative and relevant regarding the phylogenetic problem at hand. Thus, their signal is likely to be higher than predicted by "route" alone.
Can informative values of DNA and Protein be directly compared?
In comparing profiles from AA and DNA sequence, keep in mind that AA have a greater state space (see Simmons et al. 2004 ), so that they tend to be less subject to noise (unmodeled by Townsend, 2007) than DNA. We are currently working on a way to quantify noise with PhyDesign, but it is not yet ready for use.
What alignment formats does PhyDesign accept?
Currently, PhyDesign accepts three different formats: Fasta, Nexus, and Phylip.
What tree formats does PhyDesign accept?
Currently, PhyDesign only accepts ultrametric trees in Newick format.
What formats are acceptable for the site rate form?
If you already know the rates for each site in your alignment,
you can input these rates in the site rate form by clicking at "Instead, input a site rate file".
The site rate file format consists
of the name of a locus, followed by a colon, and the rates, separated by commas.
Each locus should be entered in one line.
Example:
gene1:0.026,0.265,1.236,.......,0.698
gene2:0.046,0.002,0.014,.......,0.972
gene3:0.667,0.665,0.748,.......,0.987
A site rate file with the right format can also be obtained from the PhyDesign
web page after uploading an ultrametric tree and an alignment. Thus, once you
run the analysis for the first time, you do not have to run it again; instead,
you can input the site rate file in the second form and obtain the phylogenetic
informativeness profiles directly.
Why do you offer different programs?
After uploading the alignment and the tree files, the user can choose a program from the drop-down menu with which to obtain the substitution rates. Once, a program has been chosen, it is possible to access to advance options where the user will be offered with different evolutionary models and parameters. For DNA sequences, we recommend to use of HyPhy, for which a HyPhy batch file was created to implement all time-reversible models. Unlike DNArates, HyPhy also accepts multifurcating trees. For amino acid sequences, rate4site is provided.
What does the Y-axis values on the phylogenetic informativeness profile represent?
It is a normalized, asymptotic likelihood density for a true synapomorphy
occurring in an asymptotically short, deep internode at historical time T of a quartet of taxa
under an infinite-states Poisson model of character evolution.
helpful? let's expand it a bit more... The key
words are "normalized" and "likelihood density". Because it is a
normalized likelihood density, the integral of the informativeness
is always one. What this means is that the height of the y-axis
is not important except for comparisons between partitions for the
same time period.
That's still probably pretty obstruse.
Practically, what this means is that the heights of your profiles
depend linearly on the unit you use to measure time in your
chronogram. If your branch lengths of your chronogram are
numerically high (say, you quantify time in minutes or seconds),
then your informativeness y-axis will have tiny magnitudes. If
the branch lengths are numerically low (say, you quantify time in
units of billions of years), then your informativeness y-axis will
have enormous magnitudes. If you use units of molecular evolution
that are chronometric but are not calibrated to an "absolute" time
scale, the same rule holds -- if the branch lengths are
numerically low, the informativeness y-axis will have large
magnitude, and if the branch lengths are numerically high, the
informativeness y-axis will have low magnitude. It is perfectly
fine to change or scale the unit of your chronometric tree so that
the y-axis adopts a range that suits your preference.
What is the difference between 'net' and 'per site' phylogenetic informativeness?
For each locus, one can calculate the net and per site phylogenetic informativeness. The net phylogenetic informativeness is normally the prediction of interest, as it should correlate with empirical results, such as the degree of support of a node. However, when noise is an issue, signal density is converged by the phylogenetic informativeness per site. Phylogenetic informativenes per site is relevant because the cost vs. benefit of sequencing and analysis may be quantified with such a measure and because compares relative power of genes without the confounding influence of gene length. For example, one gene may show "good" net profiles, but there may be shorter genes (requiring less sequencing effort) which may show better per site profiles. Combining of shorter genes with a sequencing effort equal to that of a longer gene can lead to a better phylogenetic informativeness.
What kind of information can I extract from integrating over a specific epoch?
By integrating over specific epochs, one obtains the area below the profiles. The areas below the profiles allow us to rank genes based on the phylogenetic signal for that epoch. Integration values will be largest for the genes that have the highest probability of exhibiting substitutions in the given epoch that will not be obscured in subsequent branches. Integration does not account for noise.
Why are the graphics in SVG format?
Because the SVG plots produced in this web allows to the user to have high quality graphics ready for publication in any format. In addition, they can be further modified/improved easily with any vector graphics program, such as Adobe Illustrator or the Open Source editor Inkscape.
How can I test if there is a significat difference between phylogenetic informativeness values?
We have not yet developed such a test. In principal, however, it is the same question as whether the sites exhibit different rates, which may be addressed by asking whether partitions should have different site rate distribution (e.g., different gamma distribution).
What is the high spike close to time 0 present in some profiles? How to interpret those results?
Those really recent "phantom" spikes arise because the maximum likelihood estimate for the rate of a few sites has its peak at infinity (rather unrealistic). The software that PhyDesign calls to estimate the rates smacks up against its hard-coded limit, so those few sites all are estimated to evolve at one very fast rate, leading to a spike that has little biological meaning. They can be caused by sites with indels or ambiguous sequence calls. Generally speaking, greater taxon sampling will help to better estimate the rate at those sites and thus draw down that peak until it disappears. In our experience, subsampling from a larger dataset, the profile after greater sampling is essentially the same as the one before, except the really recent peak is diminished or gone with the larger dataset. Thus, it seems that the best thing to do is to exclude those sites on the justifiable grounds that their rate is simply not well estimated by maximum likelihood. One can identify the poorly estimated rates in the rates file supplied by PhyDesign.
What does the error "No partition name list was found for [partition_name]." mean?
This error occurs when you upload a Nexus alignment with a 'Set' or MrBayes block describing characters sets (charset command) to analyze multiple data partitions at the same time. It arises when a partition is set but a list with the partition names is not present or the names have been misspelled.
This error occurs when you upload a
Nexus alignment with
'Set' or MrBayes
block to analyze multiple data partitions at the same time.
It arises when a partition is set, but the number of partition names doesn't match the number of
partitions indicated.
As indicated in the MrBayes wiki, "The elements of the partition command are:
(1) the name of the partitioning scheme; (2) an equal sign (=); (3) the number of
character divisions in the scheme; (4) a colon (:); and (5) a list of the characters
in each division, separated by commas."
What does the error "# of partitions [#] different from # of charset found [#]." mean?
This error occurs when you upload a Nexus alignment with 'Set' or MrBayes block to analyze multiple data partitions at the same time. It arises when a partition is set, but the number of partition names doesn't match the number of coordinates found using the 'charset' command. It can also arise due to misspelling the partitions.