The PCMBase package is designed as a computational engine to calculate the likelihood for a user specified model evolving on a phylogenetic tree. The package comes with some pre-implemented models, e.g. Brownian Motion or Ornstein-Uhlenbeck, both allowing for multivariate correlated evolution. However, the mathematical theory behind the method for calculating the likelihood (Mitov et al. 2019) is not limited to these two models. In fact any stochastic process that belongs to the so-called GLInv family of models is implementable in PCMBase. From a mathematical perspective one needs to be able to implement functions that calculate the compound parameters \(\vec{\omega}\), \(\mathbf{\Phi}\) and \(\mathbf{V}\). Furthermore, from the implementation side one has to be able to handle the model defining parameters (default values, allowable parametrizations, etc.).
PCMBase’s interface is based on the S3 object system (an excellent introduction by Hadley Wickham is available at http://adv-r.had.co.nz/S3.html). Each implemented model will be an S3 object and below we will describe how users can implement new model classes by themselves.
We will show how to create one’s own model class with the example of Brownian motion with drift. The model is a simple generalization of the Brownian motion model by including a deterministic drift. It is defined by the stochastic differential equation (SDE) \[d\vec{X}(t) = \vec{h} dt + \mathbf{\Sigma}_{x} d W(t),\] where \(W(t)\) is the standard Wiener process, \(\vec{h}\) is a constant \(k\)-dimensional vector and \(\mathbf{\Sigma}_{x}\) is a an upper-triangular matrix with positive diagonal elements. Setting \(\vec{h}=\vec{0}\) reduces the model to a Brownian motion process. The expectation of the process at any time \(t\), given initial value at \(t=0\) is \(E[\vec{X}(t)] = \vec{X}(0)+t\vec{h}\) and the variance is \(Var[\vec{X}(t)] = t\mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}\). Here, Based on this we can calculate the the compound parameters \(\vec{\omega}\), \(\mathbf{\Phi}\) and \(\mathbf{V}\) defining the model in the GLInv family (Eq. 30, (Mitov et al. 2019)) \[ \vec{\omega} = \vec{h}t, \] \[ \mathbf{\Phi} = \mathbf{I}, \] \[ \mathbf{V} = \mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}t. \] The above equations hold for any internal node without measured trait data. For any tip or internal node with available trait values the variance is given by \[ \mathbf{V} = \mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}t + \mathbf{\Sigma}_{e,x}\mathbf{\Sigma}_{e,x}^{T} + \mathbf{S}_{e,i}\mathbf{S}_{e,i}^{T}, \] where \(\mathbf{\Sigma}_{e,x}\) is a regime-specific or global upper triangular matrix parameter with a non-negative diagonal specifying an unknown non-phylogenetic component of the trait variance covariance matrix (e.g. an unknown measurement error), and \(\mathbf{S}_{e,i}\) is a species-specific upper triangular matrix with a non-negative diagonal specifying a known standard error of the measured trait data for each species.
We will now show how to implement the Brownian motion with drift
model in a class called “BM_drift that inherits from the”GaussianPCM”
and “PCM” classes. It is easiest if one takes an .R file from the
PCMBase package that already implements a model class and then modifies
it accordingly. We will work here with the BM.R file (that implements
the BM model). We need to redefine the generic functions of the “PCM”
and “GaussianPCM” classes to suit the Brownian motion with drift model.
We remind the reader that the function name is composed of two parts
separated by the dot. The first part is the generic function, the second
the name of the class in which it is implemented. For example
PCMCond.BM_drift
is the PCMCond
function’s
instance in the “BM_drift” class.
PCMParentClasses.BM_drift
: function returning the
parental classes. of “BM_drift”:<- function(model) {
PCMParentClasses.BM_drift c("GaussianPCM", "PCM")
}
PCMDescribe.BM_drift
: function returning a custom
description of the model.<- function(model, ...) {
PCMDescribe.BM_drift "Brownian motion model with drift"
}
PCMCond.BM_drift
: the key function when implementing a
new model class. This function returns a list with functions for
calculating the \(\vec{\omega}\), \(\mathbf{\Phi}\) and \(\mathbf{V}\) compound parameters that
define the conditional distribution of the daughter node given the value
of the trait at its parent node (\(\vec{x}_{parent}\). Following (Mitov et al. 2019) we know that in the GLInv
family a daughter node conditional on its parent is normally distributed
with expectation \(\vec{\omega}+\mathbf{\Phi}\vec{x}_{parent}\)
and variance-covariance matrix \(\mathbf{V}\). In particular the \(\vec{\omega}\), \(\mathbf{\Phi}\) and \(\mathbf{V}\) compound parameters can depend
on time (i.e. the length of the branch from the parental to the daughter
node). In the case of the BM with drift model the \(\vec{\omega}=t\cdot \vec{h}\), \(\mathbf{\Phi}\) is the identity matrix and
\(\mathbf{V}\) is the same as in the
Brownian motion model (which in itself is a limit of the OU model, hence
the reuse of the PCMConcVOU function).<- function(
PCMCond.BM_drift r = 1, metaI = PCMInfo(NULL, tree, model, verbose = verbose),
tree, model, verbose=FALSE) {
<- if(is.Global(model$Sigma_x)){as.matrix(model$Sigma_x)}
Sigma_x else{as.matrix(model$Sigma_x[,, r])}
<- Sigma_x %*% t(Sigma_x)
Sigma if(!is.null(model$Sigmae_x)) {
<- if(is.Global(model$Sigmae_x)){as.matrix(model$Sigmae_x)}
Sigmae_x else{as.matrix(model$Sigmae_x[,,r])}
<- Sigmae_x %*% t(Sigmae_x)
Sigmae else {
} <- NULL
Sigmae
}
if(!is.null(model$h_drift)) {
<- if(is.Global(model$h_drift)) as.vector(model$h_drift) else model$h_drift[, r]
h_drift else{
}<- rep(0,nrow(Sigma_x))
h_drift
}
<- PCMCondVOU(matrix(0, nrow(Sigma), ncol(Sigma)), Sigma, Sigmae)
V <- function(t, edgeIndex, metaI) {
omega *h_drift
t
}<- function(t, edgeIndex, metaI, e_Ht = NULL) {
Phi diag(nrow(Sigma))
}list(omega = omega, Phi = Phi, V = V)
}
PCMDescribeParameters.BM_drift
: a function that
returns a list with a custom description of each model parameter.<- function(model, ...) {
PCMDescribeParameters.BM_drift list(
X0 = "trait values at the root",
h_drift = "drift vector modifying the expectation",
Sigma_x = "Upper triangular factor of the unit-time variance rate",
Sigmae_x = "Upper triangular factor of the non-heritable variance
or the variance of the measurement error")
}
PCMListParameterizations.BM_drift
: a function that
returns all possible parametrizations for the implemented model class.
These parametrizations correspond to how each parameter defining the
model can be parametrized. Probably from the perspective of just
calculating the likelihood given some parameters this function does not
seem that useful. However, one should not forget that PCMBase is
designed to be a computational engine providing the likelihood that will
be optimized over some other code. Here, \(\vec{X}(0)\) and \(\vec{h}\) are vectors
(i.e. “VectorParameter”). The “_AllEqual” parametrization means all the
entries of the vector should be equal. “_Global” means that it is the
same for all regimes (notice that \(\vec{X}(0)\) is the value at the root, so
it has to be common for all), “_Omitted” means not present (for \(\vec{h}\) this means that the model will
correspond to a Brownian motion with \(0\) drift). Finally “_Fixed” means that the
parameter is “known”, i.e. it is not to be optimized over. Matrix
parametrizations are more involved. For example one will not optimize
over a covariance matrix but e.g. over its decomposition as a product of
an upper triangular matrix with its transpose. Here we just have the
\(\mathbf{\Sigma}_{x}\) and \(\mathbf{\Sigma}_{e,x}\) matrices (however
see the “OU” class for more involved cases with the \(H\) matrix). Both of them enter the
likelihood (through \(\mathbf{V}\)) as
a product of themselves and their transposition i.e. \(\mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}\)
so decomposition into a triangular matrix (with non-negative diagonal)
suffices for unique identification of the matrix. Note that this
parametrization guarantees that the matrix \(\mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}\)
is a symmetric semi- positive-definite matrix. If the diagonal elements
of \(\mathbf{\Sigma}_{x}\) are strictly
non-zero \(\mathbf{\Sigma}_{x}\mathbf{\Sigma}_{x}^{T}\)
will be positive-definite. A detailed description of the different
possible parametrizations is given in the The
PCMBase Parametrization API guide.<- function(model, ...) {
PCMListParameterizations.BM_drift list(
X0 = list(
c("VectorParameter", "_Global"),
c("VectorParameter", "_Fixed", "_Global"),
c("VectorParameter", "_AllEqual", "_Global"),
c("VectorParameter", "_Omitted")),
h_drift = list(
c("VectorParameter"),
c("VectorParameter", "_Fixed"),
c("VectorParameter", "_AllEqual"),
c("VectorParameter", "_Omitted")),
Sigma_x = list(
c("MatrixParameter", "_UpperTriangularWithDiagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_Diagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_ScalarDiagonal", "_WithNonNegativeDiagonal")),
Sigmae_x = list(
c("MatrixParameter", "_UpperTriangularWithDiagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_Diagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_ScalarDiagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_UpperTriangularWithDiagonal", "_WithNonNegativeDiagonal", "_Global"),
c("MatrixParameter", "_Diagonal", "_WithNonNegativeDiagonal", "_Global"),
c("MatrixParameter", "_ScalarDiagonal", "_WithNonNegativeDiagonal", "_Global"),
c("MatrixParameter", "_Omitted"))
) }
PCMListDefaultParameterizations.BM_drift
: this
function is optional to define but can be useful if only a subset of the
parametrizations defined in the PCMListParametrizations
function will actually be used in practice.<- function(model, ...) {
PCMListDefaultParameterizations.BM_drift list(
X0 = list(
c("VectorParameter", "_Global"),
c("VectorParameter", "_Omitted")
),h_drift = list(
c("VectorParameter")),
Sigma_x = list(
c("MatrixParameter", "_UpperTriangularWithDiagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_Diagonal", "_WithNonNegativeDiagonal"),
c("MatrixParameter", "_ScalarDiagonal", "_WithNonNegativeDiagonal")
),
Sigmae_x = list(
c("MatrixParameter", "_Omitted"))
) }
PCMSpecify.BM_drift
: generate default model
parameters. Notice that here we obtain a singular model with 0 mean and
0 variance.<- function(model, ...) {
PCMSpecify.BM_drift <- list(
spec X0 = structure(0.0, class = c('VectorParameter', '_Global'),
description = 'trait values at the root'),
h_drift = structure(0.0, class = c('VectorParameter'),
description = 'drift vector modifying the expectation'),
Sigma_x = structure(0.0, class = c('MatrixParameter', '_UpperTriangularWithDiagonal',
'_WithNonNegativeDiagonal'),
description = 'Cholesky factor of the unit-time variance rate'),
Sigmae_x = structure(0.0, class = c('MatrixParameter', '_UpperTriangularWithDiagonal',
'_WithNonNegativeDiagonal'),
description = 'Upper triangular factor of the non-heritable variance
or the variance of the measurement error'))
attributes(spec) <- attributes(model)
if(is.null(names(spec))) names(spec) <- c('X0', 'h_drift', 'Sigma_x', 'Sigmae_x')
if(any(sapply(spec, is.Transformable))) class(spec) <- c(class(spec), '_Transformable')
spec }
Now that we have defined all the code necessary for the class let us demonstrate it. After running all the above code defining the “BM_drift” class we create a model instance. We do this for a two regimes model.
<- c(5, 2, 1) ## root state
X0
## in regime a traits evolve independently
<- rbind(c(1.6, 0.0, 0.0),c(0.0, 2.4, 0.0),c(0.0, 0.0, 2.0))
a.Sigma_x ## no jumps at the end of a branch
<- rbind(c(0.0, 0.0, 0.0),c(0.0, 0.0, 0.0),c(0.0, 0.0, 0.0))
a.Sigmae_x <-c(4, 5, 6)
a.h_drift
## in regime b evolution is correlated
<- rbind(c(1.6, 0.3, 0.3), c(0.0, 0.3, 0.4),c(0.0, 0.0, 2.0))
b.Sigma_x ## no jumps at the end of a branch
<- rbind(c(0.0, 0.0, 0.0),c(0.0, 0.0, 0.0),c(0.0, 0.0, 0.0))
b.Sigmae_x <-c(1, 2, 3)
b.h_drift
<- PCMParamBindRegimeParams(a = a.Sigma_x, b = b.Sigma_x)
Sigma_x <- PCMParamBindRegimeParams(a = a.Sigmae_x, b = b.Sigmae_x)
Sigmae_x <- PCMParamBindRegimeParams(a = a.h_drift, b = b.h_drift)
h_drift
<- PCM("BM_drift", k = 3, regimes = c("a", "b"),
PCMBase_model_BM_drift params = list(X0 = X0,h_drift = h_drift[,,drop=FALSE],
Sigma_x = Sigma_x[,,,drop=FALSE],Sigmae_x = Sigmae_x[,,,drop=FALSE]))
Now we simulate a random phylogeny using the same example code as in the Getting started guide.
# make results reproducible
set.seed(2, kind = "Mersenne-Twister", normal.kind = "Inversion")
# number of regimes
<- 2
R
# number of extant tips
<- 100
N
<- PCMTree(rtree(n=N))
tree.a PCMTreeSetLabels(tree.a)
PCMTreeSetPartRegimes(tree.a, part.regime = c(`101` = "a"), setPartition = TRUE)
<- PCMTreeListDescendants(tree.a)
lstDesc <- names(lstDesc)[which(sapply(lstDesc, length) > N/2 & sapply(lstDesc, length) < 2*N/3)][1]
splitNode
<- PCMTreeInsertSingletons(
tree.ab nodes = as.integer(splitNode),
tree.a, positions = PCMTreeGetBranchLength(tree.a, as.integer(splitNode))/2)
PCMTreeSetPartRegimes(
tree.ab,part.regime = structure(c("a", "b"), names = as.character(c(N+1, splitNode))),
setPartition = TRUE)
<- PCMColorPalette(2, c("a", "b"))
palette
# Plot the tree with branches colored according to the regimes.
# The following code works correctly only if the ggtree package is installed,
# which is not on CRAN.
<- PCMTreePlot(tree.ab)
plTree if(requireNamespace("ggtree")) {
<- plTree + ggtree::geom_nodelab(size = 2)
plTree
} plTree
We simulate the traits using PCMBase’s functionality.
<-PCMSim(tree.ab, PCMBase_model_BM_drift, X0)[,1:N] ## we only want the tip data
mData## NOTE that observations from different species are in the columns NOT in the rows as
## in other software
Finally we calculate the likelihood under the BM with drift model.
<- PCMLik(mData, tree.ab, PCMBase_model_BM_drift)
log_likprint(log_lik[1]) ## we just want to print the log-likelihood without the attributes
## [1] -564
If PCMBase is used as a computational engine for some inference package, then the above code can be one way of obtaining the likelihood under a “GaussianPCM” model object. However, in such a setup it is recommended to use the mechanism of creating a likelihood function for a particular dataset through PCMCreateLikelihood. This will speed up the calculations as it avoids re-creating some internal data objects for the tree every time the likelihood value is required. One just needs to update the parameters to obtain a new likelihood value.
## create an vector of appropriate length to store the vectorized model parameters
<- double(PCMParamCount(PCMBase_model_BM_drift))
v_param
# load the current model parameters into param
PCMParamLoadOrStore(PCMBase_model_BM_drift, v_param, offset=0, load=FALSE)
## [1] 33
print(v_param)
## [1] 5.0 2.0 1.0 4.0 5.0 6.0 1.0 2.0 3.0 1.6 0.0 2.4 0.0 0.0 2.0 1.6 0.3 0.3 0.3
## [20] 0.4 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## now create a likelihood function for the particular model and observed data
<- PCMCreateLikelihood(mData, tree.ab, PCMBase_model_BM_drift)
likFun
<-likFun(v_param)
log_lik_from_likFunprint(log_lik_from_likFun[1])
## [1] -564
print(log_lik_from_likFun[1]==log_lik[1])
## [1] TRUE
# modify slightly the model parameters
<- jitter(v_param)
v_param_2
print(v_param_2)
## [1] 4.9865 2.0053 1.0026 3.9961 4.9818 6.0110 0.9975 1.9800 2.9915
## [10] 1.5923 -0.0188 2.3946 -0.0055 0.0087 2.0048 1.6134 0.2837 0.2993
## [19] 0.2829 0.4137 2.0150 0.0012 0.0155 -0.0119 0.0025 -0.0003 -0.0184
## [28] 0.0090 0.0041 0.0028 0.0026 0.0060 -0.0174
# set the new parameter vector
<-PCMBase_model_BM_drift
PCMBase_model_BM_drift_2PCMParamLoadOrStore(PCMBase_model_BM_drift_2, v_param_2, offset = 0, load=TRUE)
## [1] 33
print(PCMBase_model_BM_drift_2)
## Brownian motion model with drift
## S3 class: BM_drift, GaussianPCM, PCM; k=3; p=33; regimes: a, b. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 5 2 1
## h_drift (VectorParameter, matrix; drift vector modyfing the expectation):
## a b
## [1,] 4 1
## [2,] 5 2
## [3,] 6 3
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Upper triangular factor of the unit-time variance rate):
## , , a
##
## [,1] [,2] [,3]
## [1,] 1.6 -0.019 -0.0055
## [2,] 0.0 2.395 0.0087
## [3,] 0.0 0.000 2.0048
##
## , , b
##
## [,1] [,2] [,3]
## [1,] 1.6 0.28 0.28
## [2,] 0.0 0.30 0.41
## [3,] 0.0 0.00 2.01
##
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Upper triangular factor of the non-heritable variance or the variance of the measurement error):
## , , a
##
## [,1] [,2] [,3]
## [1,] 0.0012 0.015 0.0025
## [2,] 0.0000 -0.012 -0.0003
## [3,] 0.0000 0.000 -0.0184
##
## , , b
##
## [,1] [,2] [,3]
## [1,] 0.009 0.0041 0.0026
## [2,] 0.000 0.0028 0.0060
## [3,] 0.000 0.0000 -0.0174
##
##
<-likFun(v_param_2)
log_lik_from_likFun_2print(log_lik_from_likFun_2[1])
## [1] -564