Model fitting

All fitting methods aim to adjust model parameters so that model predictions match the observed data in an 'optimal' way. This set of "optimal" parameter values is called the parameter estimates. We use \(\hat{\theta}\), \(\hat{\Omega}\), and \(\hat{\Sigma}\) to denote estimates to the correspoding NONMEM parameters. Different methods define “optimal” in different ways. For example, NONMEM's classical methods such as FO use a least-squares type criterion, the exact form of which depends on the chosen error model.

In such a method for population model, NONMEM fits an approximate model, linearized with respect to the random effects for each individual. The total objective function (likelihood function) is the weighted sum of individual contributions. The approximate model is unbiased in predicted individual mean responses. These predictions are obtained setting \(\hat{\eta}=0\) and \(\epsilon=0\). In other words, the prediction is for the mean reponses of individuals with same fixed effects. In addition, one can also use \(\hat{\theta}\), \(\hat{\Omega}\), and \(\hat{\Sigma}\) to calculate the covariance matrix (hence the standard error) of individual predictions.

Objective function

Recall the model formulation for subject \(i\)'s \(N_i\) observations: $$ y_{ij} = f(x_{ij}, \phi_i) + \epsilon_{ij},\quad j=1,\dots,N_i. $$ Classic NONMEM methods assume $$ \epsilon_{ij} = \epsilon_{i,0} f(x_{ij}, \phi_i)^{\xi} $$ with unknown parameter \(\xi\) and normally-distributed scalar \(\epsilon_{i,0}\): $$ \epsilon_{i,0} \sim \mathcal{N}(0, \upsilon^2). $$ Then subject \(i\)'s individual objective function is $$ \mathcal{O}(\phi_i, \upsilon, \xi)=\sum_j\left( \frac{(f(x_{ij},\phi_i)-y_{ij})^2}{\upsilon^2 f(x_{ij},\phi_i)^{2\xi}} + \log(\upsilon^2 f(x_{ij}, \phi_i)^{2\xi}) \right) $$ This forms a weighted least square objective function with penalty. NONMEM refers this approach as the Extended Least-Square (ELS) method:

  • for each observation, the squared difference between observation \(y\) and model prediction \(f()\) is weighted by the inverse of the variance;
  • the Log term penalizes the weight so that the weight would not become arbitrarily small.

The objective function for a population of \(N\) is the sum of the individual contributions $$ \mathcal{O}(\theta, \Omega, \Sigma) = \sum_{i=1}^N \mathcal{O}_i(\phi_i, \upsilon, \xi). $$

Its depedence on \( (\theta, \Omega, \Sigma) \) is through the depedence of \(\phi\) on \(\theta\) and \(\Omega\), and that \(\Sigma\) can describe \(\upsilon\) and \(\xi\). NONMEM fits a model by searching for \( (\hat{\theta}, \hat{\Omega}, \hat{\Sigma}) \) that minimizes the objective function value (OFV).

Covariance (error) estimation

Assume all the unknown entries of \( (\theta, \Omega, \Sigma) \) are stored in vector \(p=(p_1,p_2,\dots, p_n)\), and their corresponding ELS estimate in \(\hat{p}=(\hat{p}_1,\hat{p}_2,\dots,\hat{p}_n)\). To obtain error estimation of \(\hat{p}\), NONMEM calculates two matrices:

  • the Hessian matrix \(R\) such that $$ R_{kl}=\frac{\partial \mathcal{O}^2}{\partial p_kp_l}, \quad k,l=1,\dots,n. $$ evaluated at \(\hat{p}\), and
  • matrix \(S\) such that \(S=\sum_iS_i\), where $$ S_i=\nabla_i \nabla_i^T,\quad \nabla_i=\frac{\partial \mathcal{O}_i}{\partial p}. $$ That is, \(S_i\) is the outer product of the gradient of individual OFV with respect to \(p\), evaluated at \(\hat{p}\).

Under the assumption that the random effects are normally distributed, \(R/N\) and \(S/N\) converge to a same matrix when \(N\rightarrow \infty\). In this case the inverse of either matrix serves as an estimate of the covariance matrix. When the normality assumption is not made, matrix \(R^{-1}SR^{-1}\) estimates the true covariance matrix. NONMEM uses this as the default but allows user to choose alternatives in the $COVARIANCE record.

Goodness-of-Fit

Three Goodness-of-Fit (GoF) quantities are always among the NONMEM output: prediction, residual, and weighted residual. One can also request additional quantities.

The prediction ("PRED" in NONMEM output) is the result of \(f()\) based on population estimates. In other words, it is computed at the estimated population mean by setting \(\eta=0\): $$ \text{PRED}_{ij} = f(x_{ij}, \mu(z_i, \hat{\theta})) $$

The residual ("RES" in NONMEM output) is difference between observation and PRED: $$ \text{RES}_{ij} = y_{ij} - \text{PRED}_{ij} $$

Since in application RES is inevitably correlated, often the weighted residuals ("WRES" in NONMEM output) serves as an attempted alternative. $$ \text{WRES}_{i}={\text{cov}^{-1/2}(y_i)}{\text{RES}_i} $$ Here \(\text{cov}^{1/2}\) denotes the matrix square root of the covariance matrix, and \(\text{cov}^{-1/2}\) the inverse of the matrix square root. Note here we only have individual subscript, therefore the WRES and \(y\) in the definition are in general vectors (for multiple observation of subject i).