\documentclass[12pt]{article} \usepackage{Sweave} \usepackage{myVignette} \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \DefineVerbatimEnvironment{Sinput}{Verbatim} {formatcom={\vspace{-2.5ex}},fontshape=sl, fontfamily=courier,fontseries=b, fontsize=\scriptsize} \DefineVerbatimEnvironment{Soutput}{Verbatim} {formatcom={\vspace{-2.5ex}},fontfamily=courier,fontseries=b,% fontsize=\scriptsize} %%\VignetteIndexEntry{Implementation Details} %%\VignetteDepends{Matrix} %%\VignetteDepends{lme4} \begin{document} \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=TRUE} \SweaveOpts{prefix=TRUE,prefix.string=figs/Example,include=FALSE} \setkeys{Gin}{width=\textwidth} \title{Linear mixed model implementation in lme4} \author{Douglas Bates\\Department of Statistics\\University of Wisconsin -- Madison\\\email{Bates@wisc.edu}} \maketitle \begin{abstract} Expressions for the evaluation of the profiled log-likelihood or profiled log-restricted-likelihood of a linear mixed model, the gradients and Hessians of these criteria, and update steps for an ECME algorithm to optimize these criteria are given in Bates and DebRoy (2004). A representation of linear mixed models using positive semidefinite symmetric matrices and dense matrices is given in Bates (2004). In this paper we present details of that representation and those computational methods in the \code{lme4} package for R. \end{abstract} <>= library(lattice) library(Matrix) library(latticeExtra) library(lme4) data("Early", package = "mlmRev") options(width=80, show.signif.stars = FALSE, lattice.theme = function() canonical.theme("pdf", color = FALSE)) @ \section{Introduction} \label{sec:Intro} General formulae for the evaluation of the profiled log-likelihood and profiled log-restricted-likelihood in a linear mixed model are given in \citet{bate:debr:2004} and the use of a sparse matrix representation for such models is described in \citet{bate:2004}. The purpose of this paper is to describe the details of the implementation of this representation and those computational methods in the \code{lme4} package for \RR{}. Because we concentrate on the computational methods and the representation, the order and style of presentation will be based on the sequence of calculations, not on the sequence in which the results would be derived. We will emphasize ``what'' and not ``why''. For the ``why'', refer to the papers cited above. In \S\ref{sec:Form} we describe the form and representation of the model. The calculation of the criteria to be optimized by the parameter estimates and related quantities is discussed in \S\ref{sec:Cholesky}. Details of the calculation of the ECME step and the evaluation of the gradients of the criteria are given in \S\ref{sec:ECME} and those of the Hessian in \S\ref{sec:Hessian}. In \S\ref{sec:Unconstrained} we give the details of an unconstrained parameterization for the model and the transformation of our results to this parameterization. \section{Form and representation of the model} \label{sec:Form} We consider linear mixed models of the form \begin{equation} \label{eq:lmeGeneral} \by=\bX\bbeta+\bZ\bb+\beps\quad \beps\sim\mathcal{N}(\bzer,\sigma^2\bI), \bb\sim\mathcal{N}(\bzer,\sigma^2\bOmega^{-1}), \beps\perp\bb \end{equation} where $\by$ is the $n$-dimensional response vector, $\bX$ is an $n\times p$ model matrix for the $p$ dimensional fixed-effects vector $\bbeta$, $\bZ$ is the $n\times q$ model matrix for the $q$ dimensional random-effects vector $\bb$ that has a Gaussian distribution with mean $\bzer$ and relative precision matrix $\bOmega$ (i.e., $\bOmega$ is the precision of $\bb$ relative to the precision of $\beps$), and $\beps$ is the random noise assumed to have a spherical Gaussian distribution. The symbol $\perp$ indicates independence of random variables. The matrix $\bX$ has full column rank and $\bOmega$ is positive definite. The random effects vector $\bb$ and the columns of $\bZ$ are divided into $k$ outer blocks associated with grouping factors $\bff_i, i=1,\dots,k$ in the data. Because so many of the expressions that we will use involve the grouping factors and quantities associated with them, we will reserve the letter $i$ to index the grouping factors and related quantities. We will not explicitly state the range of $i$, which will always be $i=1,\dots,k$. The outer blocks in $\bb$ and the columns of $\bZ$ are further subdivided into $m_i$ inner blocks of size $q_i$ where $m_i$ is the number of levels of $\bff_i$ (i.e.{} the number of distinct values in $\bff_i$). Each grouping factor has associated with it an $n\times q_i$ model matrix $\bZ_i$. The full model matrix $\bZ$ is derived from the grouping factors $\bff_i$ and these submodel matrices $\bZ_i$. Random effects in different outer blocks are independent. Within each of these blocks, the inner blocks of random effects are independent and identically distributed with mean $\bzer$ and $q_i\times q_i$ variance-covariance matrix $\sigma^2\bOmega_i\inv$. Thus $\bOmega$ is block diagonal in $k$ blocks of size $m_i q_i\times m_i q_i$ and each of these blocks is itself block diagonal in $m_i$ blocks, each of which is the positive definite symmetric $q_i\times q_i$ matrix $\bOmega_i$. We call this a \emph{repeated block/block diagonal} matrix. \subsection{Model specification and representation} \label{sec:Representation} Like most model fitting functions in the S language, the \code{lme} function in the \code{lme4} package uses a formula/data specification. The formula specifies how to evaluate the response $\by$ and the fixed-effects model matrix $\bX$ in the data frame provided as the \code{data} argument. An additional argument named \code{random} specifies the names of the grouping factors and the formulae for evaluation of the model matrices $\bZ_i$. Standard optional arguments, such as \code{na.action} and \code{subset}, are passed through to the \code{model.frame} function that returns the model frame in which to evaluate all the formulae of the model. The \code{drop.unused.levels} optional argument is set to \code{TRUE} so that unused levels in any factors are dropped during creation of the model frame. Thus there is no ambiguitiy regarding the number of levels $m_i$ of each of the $\bff_i$. Every level in each of the $\bff_i$ must occur at least once in the model frame. \subsubsection{Pairwise crosstabulation} We begin by ordering the grouping factors so that $m_1\ge m_2\ge \dots \ge m_k$ and, if $k>1$, forming their \emph{pairwise crosstabulation}. This is the crossproduct matrix, $\bT=\tilde{\bZ}\trans\tilde{\bZ}$, for the \emph{variance components model} determined by these grouping factors. (In a variance components model $q_1=q_2=\dots=q_k=1$ and each of the $\tilde{\bZ}_i$ is the $n\times 1$ matrix of $1$s. The $i$th outer block of columns in $\tilde{\bZ}$ is the set of indicator columns for the levels of $\bff_i$. The name ``variance components'' reflects the fact that, in this model, each of the $\bOmega_i$, which are scalars, is the relative precision of the component of the variation in the response attributed to factor $\bff_i$.) In fact it is not necessary to form and store $\bT$. All that we need are the locations of the nonzeros in the off-diagonal blocks of the representation \begin{equation} \label{eq:PairwiseCrosstab} \bT=\begin{bmatrix} \bT_{(1,1)}&\bT_{(2,1)}\trans&\dots&\bT_{(k,1)}\trans\\ \bT_{(2,1)}&\bT_{(2,2)}&\dots&\bT_{(k,2)}\trans\\ \vdots&\vdots&\ddots&\vdots\\ \bT_{(k,1)}&\bT_{(k,2)}&\dots&\bT_{(k,k)}\trans \end{bmatrix} . \end{equation} (Because the $i$th outer block of columns in $\tilde{\bZ}$ is a set of indicators, the diagonal blocks will themselves be diagonal and their patterns of nonzeros are trivial.) The blocks in the strict lower triangle, $\bT_{(i,j)}, i>j$ are stored as a list of compressed, sparse, column-oriented matrices (see Appendix~\ref{app:Sparse} for details). Only the column pointers and the row indices of these sparse matrices are used.. These off-diagonal blocks are easily calculated because the integer representations of the factors $\bff_i$ and $\bff_j$ form the row and column indices in a (possibly redundant) triplet representation of $\bT_{(i,j)}$. All that we need to do is convert the triplet representation to the compressed, sparse, column-oriented representation. This common conversion is easily accomplished with standard software. We check whether each of the matrices $\bT_{(j,j-1)},j=2,\dots,k$ has exactly one nonzero in each column. If so, the grouping factors form a \emph{nested sequence} in that each level of factor $\bff_i$ occurs with exactly one level of factor $\bff_j, i1$ because $\bT$ is only positive semidefinite, not positive definite. It is composed of multiple blocks of indicators and the row sums within each block are always unity.) Because $\bT_{(1,1)}$ is diagonal, the block in the $(1,1)$ position of $\tL$ will be both diagonal and unit, which is to say that it is the identity matrix of size $m_1$. A consequence of the $(1,1)$ block being the identity is that the structure of the first column of blocks of $\tL$ is the same as the corresponding block of $\bT$. That is, the nonzeros in $\tL_{(j,1)}$ are in the same positions as those in $\bT_{(j,1)}$ for $1>= Early$tos <- Early$age - 0.5 str(Early) @ <>= print(xyplot(cog ~ age | id, Early, type = 'b', aspect = 'xy', layout = c(23,5), between = list(y = c(0,0,0.5)), skip = rep(c(0,1),c(58,11)), xlab = "Age (yr)", ylab = "Cognitive development score", scales = list(x = list(tick.number = 3, alternating = TRUE, labels = c("1","","2"), at = c(1,1.5,2))), par.strip.text = list(cex = 0.7))) @ A lattice plot of the longitudinal scores is given in Figure~\ref{fig:EarlyData}. \begin{figure}[tbp] \centering \includegraphics[width=\textwidth]{figs/Example-EarlyData} \caption[Cognitive development scores]{Cognitive development scores for 103 infants; 45 in the control group (identification numbers above 900) and 58 in the treatment group. The data from each infant are displayed in separate panels. Those for the infants in the treatment group are in the lower three rows of panels; the control group are in the upper two rows. Within each group the panels are ordered by increasing initial cognitive score.} \label{fig:EarlyData} \end{figure} These data are grouped by subject, as is common in longitudinal data. The only grouping factor for random effects is \code{id}. (The subjects are themselves grouped into a treatment group and a control group but that dichotomy would be modeled with fixed effects, not random effects.) As an initial model an analyst may fit an \emph{unconditional growth} model~\cite[ch.~4]{Sing:Will:2003} <>= fm1 <- lmer(cog ~ tos + (tos|id), Early) @ or proceed immediately to a model that allows for the effects of the treatment <>= fm2 <- lmer(cog ~ tos * trt + (tos|id), Early) @ The structure of the resulting object is <>= str(fm2) @ \section{Frechet derivatives} \label{sec:Frechet} We will denote the Frechet derivative of the matrix $\bOmega$ with respect to $\bOmega_i$ by $\der_{\bOmega_i}\bOmega$. This is a four dimensional array that can be regarded as an indexed set of $q_i^2$ matrices, each the same size as $\bOmega$. These matrices, which we call the $q_i^2$ ``faces'' of the array, are indexed by the rows $r$ and the columns $c$, $r,c=1,\dots,q_i$ of $\bOmega_i$. The $(r,c)$th face, which we write as $\der_{i:r,c}\bOmega$, is the derivative of $\bOmega$ with respect to the $(r,c)$th element of $\bOmega_i$. It is a repeated block/block diagonal matrix where all of the outer diagonal blocks are zero except for the $(i,i)$ diagonal block which is block diagonal in $m_i$ blocks of $\bbe_r\bbe_c\trans$, where $\bbe_j$ is the $j$th column of the identity matrix of size $q_i$. An expression of the form $\tr\left[\left(\der_{\bOmega_i}\bOmega\right)\bM\right]$ where $\bM$ is a matrix of the same size as $\bOmega$ evaluates to $q_i^2$ scalars indexed by $r$ and $c$, $r,c=1,\dots,q_i$, which we convert to a $q_i\times q_i$ matrix in the obvious way. This type of expression can be simplified as \begin{equation} \label{eq:derOmegai} \tr\left[(\der_{\bOmega_i}\bOmega)\bM\right] = \sum_{j=1}^{m_i}\tr\left[(\der_{\bOmega_i}\bOmega_i)\bM_{i,i,j,j}\right] = \sum_{j=1}^{m_i}\bM_{i,i,j,j}\trans \end{equation} where $\bM_{i,i,j,j}$ is the $j$th inner diagonal block in the $i$th outer diagonal block of $\bM$. To establish the last equality in (\ref{eq:derOmegai}) note that for any $q_i\times q_i$ matrix $\bA$ the entry in row $r$ and column $c$ of $\tr\left[(\der_{\bOmega_i}\bOmega_i)\bA\right]$ is \begin{equation} \label{eq:rowRcolCentry} \left\{\tr\left[(\der_{\bOmega_i}\bOmega_i)\bA\right]\right\}_{r,c} = \tr\left[\bbe_r\bbe_c\trans\bA\right] = \tr\left[\bbe_c\trans\bA\bbe_r\right] = \bbe_c\trans\bA\bbe_r = \bA_{c,r} \end{equation} If $\bM$ is symmetric, as is the case in all the expressions of this form that we consider, then $\tr\left[\left(\der_{\bOmega_i}\bOmega\right)\bM\right]$ will be symmetric. Expressions of this form that we will require include \begin{equation} \label{eq:derOmegaInv} \begin{aligned} \tr\left[\left(\der_{\bOmega_i}\bOmega\right)\left(\bOmega\right)^{-1}\right] &= \sum_{j=1}^{m_i}\left(\bOmega_{i,i,j,j}\right)\inv\\ &= m_i\bOmega_i\inv \end{aligned}, \end{equation} \begin{equation} \label{eq:bbDer} \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \left( \frac{\widehat{\bb}}{\ryy} \frac{\widehat{\bb}} {\ryy} \trans\right)\right] = \bB_i \bB_i\trans \end{equation} where $\bB_i$ is the $q_i\times m_i$ matrix whose $j$ column is $\widehat{\bb}_{i,j}/\ryy, j=1,\dots,m_i$ (recall that these vectors are in the $p+1$st column of the \code{RZX} slot after the inversion step), and $\tr \left[ \left( \der_{\bOmega_i} \bOmega \right) (\bZ\trans\bZ+\bOmega)^{-1} \right]$. The last term requires evaluation of the inner diagonal blocks of \begin{equation} \label{eq:ZtZinv} \left(\bZ\trans\bZ +\bOmega\right)^{-1}=\left(\RZZ\trans\RZZ\right)\inv= =\bL\invtrans\bD^{-1/2}\bD^{-\mathsf{T}/2}\bL^{-1} \end{equation} When $k=1$, $\bL$ is an identity matrix and the result is simply \begin{equation*} \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \bD^{-1/2}\bD^{-\mathsf{T}/2}\right]= \sum_{j=1}^{m_1}\bD_{1:j}^{-1/2}\bD_{1:j}^{-\mathsf{T}/2} . \end{equation*} When $k>1$ we must evaluate the crossproduct of each of the $m_i$ inner blocks of $q_i$ columns in the $i$th outer block of columns of $\bD^{-\mathsf{T}/2}\bL^{-1}=\RZZ\invtrans$, which we do using blocked, sparse matrix techniques. See Appendix~\ref{ssec:InverseBlocks} for details. For REML results we also need \begin{multline} \label{eq:Vbterm} \tr\left[\left(\der_{\bOmega_i}\bOmega\right)\vb\right] = \\ \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \RZZ\inv\RZZ\invtrans\right] + \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \RZZ\inv\RZX\RXX\inv\RXX\invtrans\RZX\trans\RZZ\invtrans\right] \end{multline} We have just described how to calculate the first term in (\ref{eq:Vbterm}). The second term is evaluated as the crossproducts of the $m_i$ inner blocks of $q_i$ rows in the $i$th outer block of the rows of $-\RZZ\inv\RZX\RXX\inv$, which is calculated and stored in the \code{RZX} slot during the inversion of the factorization, as described in \S\ref{ssec:Inverting}. \section{ECME updates} \label{sec:ECME} ECME iterations provide a stable, but only linearly convergent, optimization algorithm for the objective functions $f$ and $f_R$. We use a moderate number (the default is 20) of them to refine our initial estimates $\bOmega_i^{(0)}$ and bring us into the region of the optimum before switching to Newton iterations on the unconstrained parameter vector $\btheta$. The ECME iterations can be defined in terms of the $\bOmega_i$. At the $\nu$th iteration, $\bOmega^{(\nu+1)}$ is defined to be the repeated block/block diagonal, symmetric, positive definite matrix that satisfies the $k$ systems of equations \begin{equation} \label{eq:theta1} \tr\left[\der_{\bOmega_i}\bOmega\left( \frac{\widehat{\bb}^{(\nu)}}{\widehat{\sigma}^{(\nu)}} \frac{\left.{\widehat{\bb}}^{(\nu)}\right.\trans}{\widehat{\sigma}^{(\nu)}}+ \left(\bZ\trans\bZ+\bOmega^{(\nu)}\right)^{-1} -\left(\bOmega^{(\nu+1)}\right)^{-1}\right)\right]=\bzer \end{equation} for ML estimates or the $k$ systems \begin{equation} \label{eq:theta1R} \tr\left[\der_{\bOmega_i}\bOmega\left( \frac{\widehat{\bb}^{(\nu)}}{\widehat{\sigma}_R^{(\nu)}} \frac{\left.{\widehat{\bb}}^{(\nu)}\right.\trans}{\widehat{\sigma}_R^{(\nu)}}+ \vb^{(\nu)} -{\bOmega^{(\nu+1)}}^{-1}\right)\right]=\bzer \end{equation} for REML. The results of \S\ref{sec:Frechet} provide \begin{equation} \label{eq:theta1soln} \bOmega_i^{(\nu+1)}=m_i\left\{ \tr \left[ \left( \der_{\bOmega_i} \bOmega \right) (\bZ\trans\bZ+\bOmega^{(\nu)})^{-1} \right]+n\bB_i^{(\nu)}{\bB_i^{(\nu)}}\trans\right\}^{-1} \end{equation} for ML and \begin{equation} \label{eq:theta1Rsoln} \bOmega_i^{(\nu+1)}=m_i\left\{ \tr \left[ \left( \der_{\bOmega_i} \bOmega \right)\vb \right]+(n-p)\bB_i^{(\nu)}{\bB_i^{(\nu)}}\trans\right\}^{-1} \end{equation} for REML. \subsection{Gradient evaluations} \label{ssec:Gradient} The gradients of (\ref{eq:ProfiledLogLik}) and (\ref{eq:ProfiledLogRestLik}) with respect to $\bOmega_i$ are \begin{align} \label{eq:gradDev} \nabla_{\bOmega_i}f&=\tr\left[\der_{\bOmega_i}\bOmega\left( (\bZ\trans\bZ+\bOmega)^{-1}-\bOmega^{-1}+ \frac{\widehat{\bb}}{\widehat{\sigma}} \frac{\widehat{\bb}}{\widehat{\sigma}}\trans\right)\right]\\ \label{eq:gradDevRest} \nabla_{\bOmega_i} f_R&=\tr\left[\der_{\bOmega_i}\bOmega\left( \vb-\bOmega^{-1}+ \frac{\widehat{\bb}}{\widehat{\sigma}_R} \frac{\widehat{\bb}}{\widehat{\sigma}_R}\trans\right)\right] \end{align} and we have already described how to calculate all those terms. \section{Evaluation of the Hessian} \label{sec:Hessian} The Hessians of the scalar functions $f$ and $f_R$ are symmetric matrices of second derivatives. As for the calculation of the gradient, we first evaluate the Hessian with respect to the $Q=\sum_{i=1}^k q_i^2$ dimensional vector $\bomega=\left[\bomega_1\trans,\dots,\bomega_k\trans\right]\trans$ where $\bomega_i=\ovec\bOmega_i$ and `$\ovec$' is the vectorization function that produces a column vector from a matrix by concatenating the columns. That is, we temporarily ignore the fact that the $\bOmega_i$ must be positive definite and symmetric and we consider each of the elements in these matrices separately. In \S\ref{sec:Unconstrained} we describe an unconstrained parameter vector $\btheta$ of length $\sum_{i=1}^k\binom{q_i+1}{2}$ and the conversion of the gradient and Hessian for $\bomega$ to the corresponding quantities for $\btheta$. For convenience we index the $Q$ elements of $\bomega$ by triplets $i:r,c$ denoting the element at row $r$ and column $c$ of $\bOmega_i$. \citet{bate:debr:2004} show that the element in row $i:r,c$ and column $j:s,t$ of the Hessian is \begin{equation} \label{eq:HessianML} \begin{aligned} \left\{\nabla_{\bomega}^2 f\right\}_{i:r,c;j:s,t}= &\tr\left[\der_{i:r,c}\left(\der_{j:s,t}\bOmega\right) \left(\left(\bZ\trans\bZ+\bOmega\right)\inv-\bOmega\inv+ \frac{\widehat{\bb}}{\widehat{\sigma}} \frac{{\widehat{\bb}}\trans}{\widehat{\sigma}}\right)\right]\\ &-\tr\left[\der_{i;r,c}\bOmega\left(\bZ\trans\bZ+\bOmega\right)\inv \der_{j;s,t}\bOmega\left(\bZ\trans\bZ+\bOmega\right)\inv\right]\\ &+\tr\left[\left(\der_{i;r,c}\bOmega\right)\bOmega\inv \left(\der_{j;s,t}\bOmega\right)\bOmega\inv\right]\\ &-2\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}} \left(\der_{i;r,c}\bOmega\right)\vb \left(\der_{j;s,t}\bOmega\right)\frac{\widehat{\bb}}{\widehat{\sigma}}\\ &-\frac{1}{n} \left(\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}} \left(\der_{i;r,c}\bOmega\right) \frac{\widehat{\bb}}{\widehat{\sigma}}\right) \left(\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}} \left(\der_{j;s,t}\bOmega\right) \frac{\widehat{\bb}}{\widehat{\sigma}}\right) \end{aligned} \end{equation} for ML estimates, and \begin{equation} \label{eq:HessianREML} \begin{aligned} \left\{\nabla_{\bomega}^2 f_R\right\}_{i:r,c;j:s,t}= &\tr\left[\der_{i:r,c}\left(\der_{j:s,t}\bOmega\right) \left(\vb-\bOmega\inv+ \frac{\widehat{\bb}}{\widehat{\sigma}} \frac{{\widehat{\bb}}\trans}{\widehat{\sigma}}\right)\right]\\ &-\tr\left[\der_{i;r,c}\bOmega\vb \der_{j;s,t}\bOmega\vb\right]\\ &+\tr\left[\left(\der_{i;r,c}\bOmega\right)\bOmega\inv \left(\der_{j;s,t}\bOmega\right)\bOmega\inv\right]\\ &-2\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}_R} \left(\der_{i;r,c}\bOmega\right)\vb \left(\der_{j;s,t}\bOmega\right)\frac{\widehat{\bb}}{\widehat{\sigma}_R}\\ &-\frac{1}{n} \left(\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}_R} \left(\der_{i;r,c}\bOmega\right) \frac{\widehat{\bb}}{\widehat{\sigma}_R}\right) \left(\frac{{\widehat{\bb}}\trans}{\widehat{\sigma}_R} \left(\der_{j;s,t}\bOmega\right) \frac{\widehat{\bb}}{\widehat{\sigma}_R}\right) \end{aligned} \end{equation} for REML. The matrix $\der_{j:s,t}\bOmega$ is constant so the first term in both (\ref{eq:HessianML}) and (\ref{eq:HessianREML}) is zero. \section{Unconstrained parameterization} \label{sec:Unconstrained} The vector $\btheta=\left[\btheta_1\trans,\dots,\btheta_k\trans\right]\trans$ is an unconstrained parameterization of $\bOmega$ based on the LDL form of the Cholesky decomposition of the $\bOmega_i$ \begin{equation} \label{eq:OmegaLDL} \bOmega_i=\bL_i\bD_i\bL_i\trans . \end{equation} where $\bL_i$ is a $q_i\times q_i$ unit lower triangular matrix and $\bD_i$ is a $q_i\times q_i$ diagonal matrix with positive diagonal elements. We use the $q_i$ logarithms of the diagonal elements of $\bD_i$, written $\bdelta_i$, and, when $q_i>1$, the row-wise concatenation of the $\binom{q_i}{2}$ elements in the strict lower triangle of $\bL_i$, written $\blambda_i$, as the $\binom{q+1}{2}$ dimensional unconstrained parameter $\btheta_i=\left[\bdelta_i\trans,\blambda_i\trans\right]\trans$ Let $g$ be any scalar function of the $\bOmega_i$ such that the $q_i\times q_i$ gradient matrices $\nabla_{\bdelta_i}f$ are symmetric. (Both $f$, the objective for ML estimation, and $f_R$, the objective for REML estimation, are such functions.) The components $\nabla_{\bdelta_i}g$ and $\nabla_{\blambda_i}g$ of the gradient vector $\nabla_{\btheta_i}g$ can be evaluated from the symmetric gradient matrix $\nabla_{\bOmega_i}g$ and the derivative of the product (\ref{eq:OmegaLDL}). For example, \begin{equation} \label{eq:deltaPartial} \begin{aligned} \left\{\nabla_{\bdelta_i}g\right\}_j &=\tr\left[\left(\nabla_{\bOmega_i}g\right) \frac{\partial \bOmega_i}{\partial \left\{\bdelta_i\right\}_j}\right]\\ &=\tr\left[\left(\nabla_{\bOmega_i}g\right)\bL_i\frac{\partial{\bD_i}} {\partial\left\{\bdelta_i\right\}_j}\bL_i\trans\right]\\ &=\exp{\left\{\bdelta_i\right\}_j}\bbe_j\trans \bL_i\left(\nabla_{\bOmega_i}g\right)\trans\bL_i\bbe_j\\ &=\left\{\bR_i\left(\nabla_{\bOmega_i}g\right)\bR_i\trans\right\}_{j,j} \end{aligned} \end{equation} for $j=1,\dots,q_i$, where $\bR_i$ is the upper Cholesky factor of $\bOmega_i$. The partial derivative with respect to element $t$ of $\blambda_i$, which determines the $r,c$th element of $\bL_i$, is \begin{equation} \label{eq:lambdaPartial} \begin{aligned} \left\{\nabla_{\blambda_i}g\right\}_t &=\tr\left[\left(\nabla_{\bOmega_i}g\right) \frac{\partial\bOmega_i}{\partial\left\{\blambda_i\right\}_t}\right]\\ &=\tr\left[\left(\nabla_{\bOmega_i}g\right) \bbe_r\bbe_c\trans\bD_i\bL_i\trans +\left(\nabla_{\bOmega_i}g\right) \bL_i\bD_i\bbe_c\bbe_r\trans\right]\\ &=\bbe_c\trans\bD_i\bL_i\trans\left(\nabla_{\bOmega_i}g\right)\bbe_r +\bbe_r\trans\left(\nabla_{\bOmega_i}g\right)\bL_i\bD_i\bbe_c\\ &=2\left\{\left(\nabla_{\bOmega_i}g\right)\bL_i\bD_i\right\}_{c,r}\\ &=2\left\{\left(\nabla_{\bOmega_i}g\right)\bR_i\trans\bD_i^{1/2}\right\}_{c,r} \end{aligned} \end{equation} \section{Acknowledgements} \label{sec:Ack} This work was supported by U.S.{} Army Medical Research and Materiel Command under Contract No.{} DAMD17-02-C-0119. The views, opinions and/or findings contained in this report are those of the authors and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation. \appendix \section{Sparse matrix formats} \label{app:Sparse} The basic form of sparse matrix storage, called the \emph{triplet} form, represents the matrix as three arrays of length \code{nz}, which is the number of nonredundant nonzero elements in the matrix. The \code{i} array contains the row indices, the \code{j} array the column indices, and the \code{x} array the values of the nonzero elements. A \emph{column-oriented} triplet form requires that the elements of these arrays be such that \code{j} is nondecreasing. A sorted, column-oriented triplet form is such that elements in \code{i} corresponding to the same \code{j} index are in increasing order. That is, the arrays are ordered by column and, within column, by row. In column-oriented storage, multiple elements in the same column produce runs in \code{j}. A \emph{compressed}, column-oriented storage form replaces \code{j} by \code{p}, an array of pointers to the indices (in \code{i} and \code{x}) of the first element in each column. If there are $n$ columns in the matrix, this array is of length $n+1$, with the last element of \code{p} being one greater than the index of last element in the last column. Then the differences of successive elements of \code{p} give the number of nonzeros in each column. In the implementation in the \code{Matrix} package for R, the indices \code{i}, \code{j}, and \code{p} are \code{0}-based, as in the C language, (i.e.{} the first element of an array has index \code{0}) not \code{1}-based, as in the S language. Thus the first element of \code{p} is always zero. \section{Blocked sparse matrix operations} \label{app:blocked} We take advantage of the blocked sparse structure of $\bZ\trans\bZ$, $\bOmega$, $\bL$ and $\bD$ in operations involving these matrices. Consider, for example, the operation of solving for $\tRZX$ in the system (\ref{eq:RZX}), $\bL\bD^{\mathsf{T}/2}\tRZX=\bZ\trans\tilde{\bX}$. Let $\bu$ be a column of $\bD^{\mathsf{T}/2}\tRZX$ and $\bv$ be the corresponding column of $\bZ\trans\tilde{\bX}$. The corresponding column of the result, which is $\bD^{-\mathsf{T}/2}\bu$, is easily evaluated from $\bu$. (Recall that $\bD^{-1/2}$ is block/block diagonal, i.e. block diagonal in $k$ outer blocks of sizes $m_i q_i\times m_i q_i$ each of which is itself block diagonal in $m_i$ blocks of size $q_i\times q_i$, and that $\bD^{-1/2}$ is calculated and stored during the inversion step.) Both $\bu$ and $\bv$ can be divided into $k$ outer blocks of sizes $m_i q_i$, which we denote $\bu_i$ and $\bv_i$, and each of the outer blocks can be divided into $m_i$ inner blocks of size $q_i$, which we denote $\bu_{i:j}$ and $\bv_{i:j}$. Because $\bL$ is lower triangular, the blocks of $\bv$ can be evaluated in a blockwise forward solve. That is, the blocked equations are solved in the order \begin{equation} \label{eq:blockwiseForward} \begin{aligned} \bL_{(1,1)}\bu_1 &= \bv_1\quad\Longrightarrow\quad \bI\bu_1 = \bu_1 = \bv_1\\ \bL_{(2,2)}\bu_2 &= \bv_2-\bL_{(2,1)}\bu_1\\ &\vdots\\ \bL_{(k,k)}\bu_k &= \bv_k-\bL_{(k,k-1)}\bu_{k-1}-\dots-\bL_{(k,1)}\bu_1 \end{aligned} \end{equation} Notice that the solution in the first block does not require any calculation because $\bL_{(1,1)}$ is always the identity. In many applications $k=1$ and the operation of solving for $\bL\bu=\bv$ can be skipped entirely. In most other applications the size of $\bu_1$ dominates the size of the other components so that being able to set $\bu_1=\bv_1$ represents a tremendous savings in computational effort. We take advantage of the blocked, sparse columns of the outer blocks of $\bL$ in evaluations of the form $\bv_2-\bL_{(2,1)}\bu_{1}$. That is, we evaluate $\bL_{(2,1)}\bu_{1}$ using the blocks of size $q_2\times q_1$ skipping blocks that we know will be zero as we update the inner blocks of $\bu_2$ in place. The solutions of systems of the form $\bL_{(i,i)}\bu_i=\tilde{\bv}_i$ where $\tilde{\bv}_i$ is the updated $\bv_i$ are also done in inner blocks taking advantage of the sparsity of $\bL_{(i,i)}^{-1}$. Recall that in the case of nested grouping factors the $\bL_{(i,i)}$ are all identity matrices and this step is skipped. \subsection{Accumulating diagonal blocks of the inverse} \label{ssec:InverseBlocks} The ECME steps and the gradient evaluation require \begin{equation} \label{eq:InverseBlocks} \tr\left[\der_{\bOmega_i}\bOmega\left(\bZ\trans\bZ+\bOmega\right)^{-1}\right] = \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \left(\bD^{-\mathsf{T}/2}\bL\inv\right)\trans \left(\bD^{-\mathsf{T}/2}\bL\inv\right)\right] \end{equation} which will be the sum of the crossproducts of the $m_i$ inner blocks of $q_i$ columns in the $i$th outer block of columns of $\bD^{-\mathsf{T}/2}\bL\inv$. When $i=k=1$ the result is $\sum_{j=1}^{m_1}\bD^{-1/2}_{1:j}\bD^{-\mathsf{T}/2}_{1:j}$, which can be evaluated in a single call to the level-3 BLAS routine \code{dsyrk}. (Having the 3 dimensional array containing the $m_i$ inner blocks of the $i$th outer block of $\bD^{-1/2}$ in the form that allows this to be done in a single, level-3 BLAS call is the reason that we store and invert the upper triangular factors of inner blocks of $\bD$.) When $i=1$ and $k>1$ we initialize the result from the $(1,1)$ block as above, then add the crossproducts of inner blocks of columns in the outer $(2,1)$ block, the outer $(3,1)$ block, and so on. These blocks of columns are calculated sparsely. During the initial decomposition we evaluate and store (in blocked, sparse format) $\bL^{-1}_{(i,i)}$ for $11$, including the $\binom{q_i+1}{2}\times q_i^2$ Jacobian $\bJ_i=\nabla_{\btheta_i}\ovec(\bOmega_i)$ and the $\binom{q_i+1}{2}\times\binom{q_i+1}{2}\times q_i^2$ second derivative array $\nabla_{\btheta_i}^2\ovec(\bOmega_i)$, are given in For $g_i$ a scalar function of $\bOmega_i$ we denote by $\nabla_{\Omega_i}g_i$ the $q_i\times q_i$ gradient matrix with elements \begin{equation} \label{eq:GradientMatrix} \left\{\nabla_{\Omega_i}g_i\right\}_{s,t}= \frac{\partial g_i}{\partial\left\{\bOmega_i\right\}_{s,t}}\quad s,t=1,\dots,q_i . \end{equation} Although $\bOmega_i$ is required to be symmetric, we do not impose this restriction on $\nabla_{\Omega_i}g_i$. (However, for all the functions $g_i(\bOmega_i)$ that we will consider this gradient is indeed symmetric.) The $\binom{q_i+1}{2}$ dimensional gradient \begin{equation} \label{eq:Chainrule} \nabla_{\btheta_i}g_i(\bOmega_i) =\bJ_i \ovec(\nabla_{\bOmega_i}g_i) \end{equation} The $\binom{q_i+1}{2}\times\binom{q_i+1}{2}$ Hessian \begin{equation} \label{eq:ChainHessian} \nabla_{\btheta_i}^2 g_i(\bOmega_i) = \bJ_i \ovec(\nabla_{\bOmega_i}^2 g_i)\bJ_i\trans +\left[\nabla_{\btheta_i}^2\ovec(\bOmega_i)\right] \left[\ovec(\nabla_{\bOmega_i} g_i)\right] \end{equation} where the ``square bracket'' multiplication in the last term represents an inner product over the last index in $\nabla_{\btheta_i}^2\ovec(\bOmega_i)$. Eventually we want to produce the $Q=\sum_{i=1}^k\binom{q_i+1}{2}$ dimensional vector $\tr\left[(\der_{\btheta}\bOmega)(\bM)\right]$ for some matrix $\bM$ of the same size as $\bOmega$. We obtain the gradient with respect to $\btheta_i$ as \begin{equation} \label{eq:derthetai} \tr\left[(\der_{\btheta_i}\bOmega)\bM\right]= \bJ_i\ovec\left(\tr\left[(\der_{\bOmega_i}\bOmega)\bM\right]\right) \end{equation} and concatenate these vectors to produce $\tr\left[(\der_{\btheta}\bOmega)(\bM)\right]$.