Likelihood and MLE

To summarize the previous two chapters, in NONMEM nomenclature, a model consists of a structure component $$ y = f(x, \phi, t) + h(x_{ij}, \phi_i)\epsilon, $$ and a statistical component $$ \phi = \mu(z, \theta, \eta). $$ This formulation in fact describes the likelihood. To simplify presentation, let us assume homoscedasticity so that \(h(\cdot)\equiv 1\) . Let \(p(y_i|\phi_i, \Sigma_i)\) be the likelihood for individual i, as a function of the individual parameter \(\phi_i\) and diagonal matrix \(\Sigma_i\). Then the NONMEM assumption that \(y_i\) is multivariate normal can be expressed as $$ p(y_i|\phi_i, \Sigma_i) \propto \det(\Sigma_i)^{1/2}\exp\left[ -\frac{1}{2}(y_i-f(x, \phi_i, t))^T\Sigma_i^{-1}(y_i-f(x, \phi_i, t)) \right]. $$ Assume \(\phi_i=\mu(z_i, \theta)+\eta_i\) and recall that \(\eta_i\) also has multivariate normal distribution, we have $$ p(\phi_i | \theta, \Omega) \propto \det(\Omega)^{1/2}\exp\left[ -\frac{1}{2}\eta_i^T\Omega\eta_i \right], $$ then the individual likelihood is

\begin{equation} p(y_i|\eta_i, \Sigma_i, \theta, \Omega)=p(y_i|\phi_i, \Sigma_i)p(\phi_i | \theta, \Omega). \end{equation}

In classical NONMEM methods, we fit the model by estimating the population parameters, with \(\eta_i\) marginalized, based on the indivudal likelihood (keep in mind that \(\phi_i\) is a function of \(\eta_i\))

\begin{equation} L_i\equiv p(y_i|\Sigma_i, \theta, \Omega) = \int_{-\infty}^{\infty} p(y_i|\phi_i, \Sigma_i)p(\phi_i | \theta, \Omega)\,d\phi_i = \int_{-\infty}^{\infty} p(y_i|\phi_i, \Sigma_i)p(\eta_i | \Omega)\,d\eta_i, \end{equation}

The total likelihood for the population is $$ L=\Pi_iL_i=L_1\times L_2\times\dots \times L_{N}. $$ The reason that we use "proportional to" (\(\propto\)) instead of "equal to" (\(=\)) is that we fit the model by maximizing the likelihood L, and this is not affected by the normalizing coefficients. In practice we actually maximize the log-likelihood \(\mathcal{L}\) for its superior numerical statbility and being simpler to work with for common distributions (from the expontential family): $$ \mathcal{L}=\sum_i\mathcal{L}_i=\sum_i\log(L_i). $$ NONMEM's point estimating methods aim to adjust parameters to maximize \(\mathcal{L}\). These methods are called maximized likelihood estimations (MLEs), and the corresponding "optimal" parameter values maximized likelihood estimates, denoted as \(\hat{\theta}\), \(\hat{\Omega}\), and \(\hat{\Sigma}\). In fact, NONMEM takes one step further by minimizing an objective function \(\mathcal{T}\), as a certain approximation of \(-2\mathcal{L}\). Different NONMEM MLE methods define \(\mathcal{T}\) differently, but they all aim to simplify \(\mathcal{L}\) evaluation while keep the estimation "close" to the MLE solution:

$$ \text{arg}\,\text{min}_{\theta, \Omega, \Sigma}\,\mathcal{T}\approx \text{arg}\,\text{min}_{\theta, \Omega, \Sigma}\,\mathcal{-2L}=\text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,\mathcal{L}. $$ The approximation is unbiased in predicted individual mean responses that are based on \(\hat{\eta}=0\) and \(\epsilon=0\). That is, the prediction is for the "typical" reponses of individuals with same fixed effects. In addition, one can also use \(\hat{\theta}\), \(\hat{\Omega}\), and \(\hat{\Sigma}\) to calculate the covariance matrix (hence the standard error) of individual predictions.

See Bayesian methods on how the individual parameter \(\eta\) is calculated in MLE, using empirical bayes estimation (EBE) (aka posthoc estimation).

Least-square method

It is helpful to view the extended least-square (ELS) method used by classical NONMEM methods through the lens of MLE. To simplify presentation here we only consider the individual model $$ y_{ij} = f(x_{ij}, \phi_i) + \epsilon_{ij},\quad j=1,\dots,N_i, $$ and assume uncorrelated normal residual errors: $$ \epsilon_{ij}\sim \mathcal{N}(0, \sigma_j^2) $$

The individual OFV for ELS is then $$ \mathcal{T}_i(\phi_i, \upsilon, \xi)=\sum_j\left( \frac{(y_{ij}-f(x_{ij},\phi_i))^2}{\sigma_j^2} + \log(\sigma^2_{j}) \right) $$ and the model OFV is $$ \mathcal{T}(\theta, \Omega, \Sigma) = \sum_{i=1}^N \mathcal{T}_i(\phi_i, \upsilon, \xi). $$ This is a weighted least square objective function with penalty:

  • for each observation, the squared difference between observation \(y\) and model prediction \(f()\) is weighted by the inverse of the variance ("precision");
  • the Log term penalizes the weight so that the weight would not become arbitrarily small.

One can see that the above ELS formulation is exactly \(-2\mathcal{L}\), with \(\sigma_{ij}\) the diagonal elements of \(\Sigma_i\). The likelihood function \(p(\phi|\theta,\Omega)\) in the previous section shows that the depedence on \( (\theta, \Omega) \) is through the \(\mu()\) model of \(\phi\), \(\theta\) and \(\Omega\).

In fact, the original ELS uses a particular form of \(\sigma_j\): $$ \sigma_{j} = \upsilon f(x_{ij}, \phi_i)^{\xi}, $$ with \(\upsilon\) and \(\xi\) both as unknown parameters (that can be modeled as elements of THETA). We can see this is equivalent to the power error model.

Covariance (standard error) estimation

Let vector \(p=(p_1,p_2,\dots, p_n)\) be the collection of all the entries of \( (\theta, \Omega, \Sigma) \), and \(\hat{p}=(\hat{p}_1,\hat{p}_2,\dots,\hat{p}_n)\) the corresponding MLE solution. NONMEM provides control record $COVARIANCE to estimate the accuracy of \(\hat{p}\), based on approximated \(\mathcal{I}^{-1}\), the inverse of the Fisher Information Matrix (FIM) \(\mathcal{I}\). To do that NONMEM calculates two matrices.

  • Matrix \(R\) $$ R_{kl}=\frac{\partial \mathcal{T}^2}{\partial p_kp_l}, \quad k,l=1,\dots,n, $$ evaluated at \(\hat{p}\). \(R\) is an approximation of the observed Fisher Information matrix except a constant coefficient (recall that \(\mathcal{T}\) approximates \(-2\mathcal{L}\)).
  • Matrix \(S=\sum_iS_i\), where $$ S_i=\nabla_i \nabla_i^T,\quad \nabla_i=\frac{\partial \mathcal{T}_i}{\partial p}, $$ evaluated at \(\hat{p}\). That is, \(S_i\) is the outer product of the gradient of individual OFV with respect to \(p\). \(S\) is an approximation of the empirical Fisher Information matrix except a constant coefficient.

Under the assumption that the random effects are normally distributed, \(R/N\) and \(S/N\) converge to a same matrix when \(N\rightarrow \infty\), and the inverse of either matrices serves as a covariance matrix estimate. Without the normality assumption, matrix \(R^{-1}SR^{-1}\) is a covariance matrix estimate. This is the default estimation in NONMEM but users have option to choose \(R^{-1}\) or \(S^{-1}\) in the $COVARIANCE record.

Goodness-of-Fit

NONMEM always outputs three Goodness-of-Fit (GoF) quantities: prediction, residual, and weighted residual.

The prediction ("PRED" in NONMEM output) is the result of \(f()\) based on population estimates. In other words, it is computed at the estimated population mean by setting \(\eta=0\): $$ \text{PRED}_{ij} = f(x_{ij}, \mu(z_i, \hat{\theta})) $$

The residual ("RES" in NONMEM output) is difference between observation and PRED: $$ \text{RES}_{ij} = y_{ij} - \text{PRED}_{ij} $$

Since in application RES is inevitably correlated, often the weighted residuals ("WRES" in NONMEM output) serves as an attempted alternative. $$ \text{WRES}_{i}={\text{cov}^{-1/2}(y_i)}{\text{RES}_i} $$ Here \(\text{cov}^{1/2}\) denotes the matrix square root of the covariance matrix, and \(\text{cov}^{-1/2}\) the inverse of the matrix square root. Note here we only have individual subscript, therefore the WRES and \(y\) in the definition are in general vectors (for multiple observation of subject i). Note that the structure of \(\text{cov}\) depends on the approximation \(\mathcal{T}\).

One can request additional quantities with $TABLE.

  • OBJI. These are objective function values for each individual. The sum of the individual objective function values is equal to the total objective function.
  • NPRED, NRES, NWRES. These are non-conditional, no eta-epsilon interaction, pred, res, and wres values. These are identical to those issued by NONMEM V as PRED, RES, and WRES.
  • PREDI, RESI, WRESI. These are non-conditional, with eta-epsilon interaction, pred, res, and wres values. These are identical to those issued by NONMEM VI as PRED, RES, and WRES. The WRESI will not differ from NWRES if INTERACTION was not selected in the previous $EST command.
  • CPRED, CRES, CWRES [2]. These are conditional, no ETA-EPSILON interaction, PRED, RES, and WRES values. Let \(\hat{\eta}\) be the conditional mode ETAs (from FOCE or ITS) 1, or the conditional mean ETAs (from Monte Carlo EM methods). If \(\hat{\eta}\) is available from a previous $EST estimation, then for subject \(i\), the conditional prediction (CPRED), the conditional residual (CRES), and the conditional weighted residual (CWRES) are $$ \text{CPRED}_i = f|_{\hat{\theta}, \hat{\eta}_i} - \frac{\partial f}{\partial \eta_i}|_{\hat{\theta},\hat{\eta}_i}\hat{\eta}_i, $$ $$ \text{CRES}_i = y_i - \text{CPRED}_i, $$ $$ \text{CWRES}_i = {\text{cov}^{-1/2}(y_i)}|_{\hat{\eta}_i}\text{CRES}_i. $$ The subscripts by "|" denote the functions and derivatives are evaluated at the estimates. CPRED values to be negative. Users may prefer to request NPRED CRES CWRES, or NPRED RES CWRES. The conditional weighted residual will not differ from the non-conditional weighted residual if FO was selected in the previous $EST command.
  • CPREDI, CRESI, CWRESI. The counterparts of CPRED, CRES and CWRES with ETA-EPSILON interaction.
  • EPRED, ERES, EWRES. These are Monte-Carlo generated (expected) PRED, RES, and WRES values without linearization found in the other diagnostics. For subject i, let \(n_i\) be the number of \(\eta_i\) samples: \(\eta_i^{(1)}\),…,\(\eta_i^{(n_i)}\), then $$ \text{EPRED}_i = \bar{f}(x_{ij}, \mu(z_i, \hat{\theta}, \eta_i))=\frac{1}{n_i}\sum_{k=1}^{n_i}f(x_{ij}, \mu(z_i, \hat{\theta}, \eta_i^{(k)})), $$ $$ \text{ERES}_i= y_i-\text{EPRED}_i, $$ $$ \text{EWRES}_i = {\text{cov}^{-1/2}(y_i)}\text{ERES}_i. $$ Note that here the residual covariance matrix \({\text{cov}(y_i)}\) is also based on \(\eta_i\) samples: $$ \text{cov}(y_i) = \frac{1}{n_i}\sum\left[h|_{\eta_i^{(k)}}\hat{\Sigma}h|_{\eta_i^{(k)}}^T + (f|_{\eta_i^{(k)}}-\text{EPRED}_i)(f|_{\eta_i^{(k)}}-\text{EPRED}_i)^T\right] $$
  • ECWRES. Similar to EWRES, but with covarinace matrix for the residual error evaluated at \(\hat{\eta}_i\) instead of samples means. $$ \text{cov}(y_i) = h|_{\hat{\eta_i}}\hat{\Sigma}h|_{\hat{\eta_i}}^T + \frac{1}{n_i}\sum\left[(f|_{\eta_i^{(k)}}-\text{EPRED}_i)(f|_{\eta_i^{(k)}}-\text{EPRED}_i)^T\right] $$
  • NPD, NPDE [1,3]. Based on the fact that should the observations be generated from the estimated parameters (in other words the estimation was perfect), the cumulated distribution function (CDF) of the residual error would follow the uniform distribution on [0, 1]. We can estimate the CDF by generating predictions through Monte Carlo simulation, and test its uniformity. NPD takes the approach of transforming the estimate back using the normal quantile function and test the result's normality. $$ \text{PD}_i = \frac{1}{n_i}\sum_k P(\text{cov}(y_i)^{-1/2}|_{\eta_i^{(k)}}(y_i-f|_{\eta_i^{(k)}})), $$ $$ \text{NPD}_i = P^{-1}(\text{PD}_i). $$ Here \(P()\) is the CDF of a normal distribution, and \(P^{-1}\) the corresponding quantile function. NPDE is the decorrelated version of NPD, takes into account within-subject correlations.

Reference

[1] E. Comets, K. Brendel, and F. Mentré, Computing normalised prediction distribution errors to evaluate nonlinear mixed-effect models: The NPDE add-on package for R, Computer Methods and Programs in Biomedicine, 90, pp. 154–166.

[2] A. C. Hooker, C. E. Staatz, and M. O. Karlsson, Conditional weighted residuals (CWRES): A model diagnostic for the FOCE method, Computer Methods and Programs in Biomedicine, 24, pp. 2187–2197.

[3] T. H. T. Nguyen, E. Comets, and F. Mentré, Extension of NPDE for evaluation of nonlinear mixed effect models in presence of data below the quantification limit with applications to HIV dynamic model, 39, pp. 499–518.


1

The conditional mode ETAs are also known as conditional parametric ETAs (CPE), empirical bayes estimates (EBE), posthoc estimates of ETAs, or mode a posteriori (MAP) estimates.