Introduction to treesurgeon

Hi all. Welcome to my blog! I'm planning on posting a variety of stuff here; including bits and bobs I have been coding, new papers, stats tutorials and regular tangents into other stuff I'm interested (expect a dnd post in the near future). I thought I'd start off by taking you through a couple of useful functions from my R package 'treesurgeon', currently in development.

I'd love to tell you that treesurgeon was developed with some altruistic goal in mind, but really it started out as a place to store useful functions. Nonetheless, it has evolved into something that is probably useful to anyone interested in trees, morphological data sets and/or ancestral state estimation. Like virtually all phylogenetic R packages, treesurgeon builds on ape. It also leans heavily on phytools, utilising the fitMk() function.

One of the reasons you might use treesurgeon is if you are running combined, or 'total-evidence' phylogenetic analyses. With treesurgeon, nexus-format data matrices can be imported into R using read_nexdat(), which is a wrapper for ape::read.nexus.data(). This wrapper adds two useful things. Firstly, it can read partitioned nexus files containing more than one data type (datatype=mixed), such as those used by MrBayes. The user can specify whether to read these files as a single list of concatenated sequences or a list comprising separate sequence partitions. These are stored as objects of class 'nexdat' or 'multi.nexdat' respectively. The user can then extract summary statistics of the imported data set using summary().

if (!require(remotes)){
  install.packages("remotes")
}
remotes::install_github("evo-palaeo/treesurgeon")
data(Lavoue2016)
summary(Lavoue2016)

$data

Ntax Nchar

26 21019

$partitions

Ntax Nchar

standard 25 84

dna 13 20935

$total

missing gaps

Arapaima 0.10 0.00

Chauliopareion 0.00 0.00

Chitala 0.15 0.03

Elops 0.10 0.00

Eohiodon 0.00 0.00

Gnathonemus 0.03 0.02

Heterotis 0.13 0.00

Hiodon 0.04 0.01

Joffrichthys 0.00 0.00

Lycoptera 0.00 0.00

Notopterus 0.37 0.00

Ostariostoma 0.00 0.00

Osteoglossum 0.07 0.00

Palaeonotopterus 0.00 0.00

Pantodon 0.17 0.01

Papyrocranus 0.41 0.00

Paralycoptera 0.00 0.00

Petrogymn 0.00 0.01

Phareodus 0.00 0.00

Scleropages 0.00 0.02

Shuleichthys 0.00 0.00

Singida 0.00 0.00

Sinoglossus 0.00 0.00

Tanolepis 0.00 0.00

Xenomystus 0.07 0.01

Xixiaichthys 0.00 0.00

$standard

missing gaps ambiguous unambiguous

Elops 0.11 0 0 0.89

Hiodon 0.04 0 0 0.96

Pantodon 0.01 0 0 0.99

Heterotis 0.02 0 0 0.98

Arapaima 0.02 0 0 0.98

Osteoglossum 0.00 0 0 1.00

Scleropages 0.01 0 0 0.99

Petrogymn 0.15 0 0 0.85

Gnathonemus 0.19 0 0 0.81

Xenomystus 0.11 0 0 0.89

Papyrocranus 0.11 0 0 0.89

Chitala 0.11 0 0 0.89

Lycoptera 0.33 0 0 0.67

Paralycoptera 0.63 0 0 0.37

Tanolepis 0.58 0 0 0.42

Sinoglossus 0.65 0 0 0.35

Eohiodon 0.20 0 0 0.80

Joffrichthys 0.56 0 0 0.44

Phareodus 0.25 0 0 0.75

Singida 0.32 0 0 0.68

Ostariostoma 0.45 0 0 0.55

Palaeonotopterus 0.76 0 0 0.24

Chauliopareion 0.38 0 0 0.62

Xixiaichthys 0.33 0 0 0.67

Shuleichthys 0.27 0 0 0.73

$dna

missing gaps ambiguous unambiguous

Elops 0.10 0.00 0 0.90

Hiodon 0.04 0.01 0 0.95

Pantodon 0.17 0.01 0 0.82

Heterotis 0.13 0.00 0 0.87

Arapaima 0.10 0.00 0 0.90

Osteoglossum 0.07 0.00 0 0.92

Scleropages 0.00 0.02 0 0.98

Petrogymn 0.00 0.01 0 0.99

Gnathonemus 0.03 0.02 0 0.94

Xenomystus 0.07 0.01 0 0.92

Papyrocranus 0.41 0.00 0 0.59

Notopterus 0.37 0.00 0 0.62

Chitala 0.15 0.03 0 0.83

attr(,"class")

[1] "summary_nexdat"

As an aside, you can visualise mixed datasets using the ComplexHeatmap package.

data(Lavoue2016)
combined_data <- cat_data(Lavoue2016$standard, Lavoue2016$dna, use.part.info = F) 

if(!require("BiocManager", quietly = TRUE)){
	install.packages("BiocManager")
}
BiocManager::install("ComplexHeatmap")
library(ComplexHeatmap)
df <- t(as.data.frame(combined_data))
df <- tolower(df)
df[df == "?"] <- NA
df[df == "-"] <- NA
df[df == "n"] <- NA
states <- as.character(na.omit(unique(as.character(df))))
cols <- rep("x", length(states))
names(cols) <- states
morph_states <- suppressWarnings(which(is.na(as.numeric(names(cols))) == F))
mol_states <- which(names(cols) %in% c("a", "c", "g", "t"))
cols[morph_states] <- hcl.colors(n = length(morph_states), palette = "Hawaii")
cols[mol_states] <- hcl.colors(n = length(mol_states), palette = "zissou")
cols[-c(mol_states, morph_states)] <- hcl.colors(n = length(cols[-c(mol_states, morph_states)]), palette = "Cividis")
Heatmap(df[,1:1500], row_names_side = "left", col = cols, na_col = "white", name = "states")

The other improvement over read.nexus.data() is that read_nexdat() handles spaces between polymorphic or uncertain character states. The way I've coded this is a bit of a bodge job and as a result the function can't handle spaces in taxon names. I have thought of a more elegant solution which fixes this, which I will get round to implementing at some point.

Another useful function relates to specifying node calibrations for clock analyses. These calibrations are typically based on the fossil record and consist of a hard minimum age—usually determined by the oldest fossil reliably assigned to the group being constrained—and a soft maximum, which reflects prior knowledge of the group's maximum age while accounting for uncertainties in phylogenetic relationships, fossil sampling, stratigraphic correlation, and absolute dating. By applying different distributions, we can represent different prior beliefs about the age of a group based on the available evidence. The term 'soft' indicates that these distributions can extend beyond the maximum age, with the size of the tail reflecting our confidence in the reliability of the maximum constraint. HOWEVER, in practice, specifying these tails using phylogenetic software is not straightforward. For example, if I wanted to specify an exponential distribution for a node, MrBayes asks for the minimum age (fine) and the mean age (!!!). The relationship between the soft maximum age of the calibration and the mean age of the prior distribution is not straightforward and will depend on the group you are trying to calibrate and your prior beliefs. To this end, I have made two functions; exp_calib() and gamma_calib(). These functions allow the user to specify a hard minimum, a soft maximum, a shape parameter (for gamma calibrations) and crucially a percentile specifying where the soft maximum should occur in the distribution. The function then plots the calibration and prints the MrBayes command to the console.

##vertebrate calibration from Benton et al (2015): hard min = 457.5, soft max = 636.1.
par(mar = c(2.2,2.2,1.5,1.5))
par(mfrow=c(3,1))
##plot calibration assuming a decreasing probability the age of the node is older than the soft maximum age and a low probability the age of the node is older than the hard-minimum age.
gamma_calib(457.5, 636.1, shape = 1, position = 0.99, xlim = c(0, 1000), ylim = c(0, 0.025))
##plot calibration assuming a high probability the age of the node is older than the soft maximum age and a low probability the age of the node is older than the hard-minimum age.
gamma_calib(457.5, 636.1, shape = 3, position = 0.99, xlim = c(0, 1000), ylim = c(0, 0.025))
##plot calibration assuming a high probability the age of the node is older than the soft maximum age and a relatively high probability the age of the node is older than the hard-minimum age.
gamma_calib(457.5, 636.1, shape = 3, position = 0.75, xlim = c(0, 1000), ylim = c(0, 0.025))

Treesurgeon can also be used to visulise the results of a Bayesian phylogenetic analysis in treespace - more on that next time.

Comments