Bayesian methods
Recall that the classical NONMEM MLE methods seek parameters \((\theta, \Omega, \Sigma)\) that maximize the log-likelihood \(\mathcal{L}=\log(L)\). With Bayesian methods we instead seek the parameters using the prior information in addition to the likelihood, based on the Bayes' theorem: $$ p(\theta, \Omega, \Sigma | y) \propto L(\theta, \Omega, \Sigma)p_0(\theta, \Omega, \Sigma). $$ Here the left-hand size \(p(\cdot|y)\) is called posterior distribution of the parameters, and \(p_0(\cdot)\) the prior distribution. Note that similar to the MLE approach presentation the \(\propto\) sign is used because the normalization factor is omitted 1.
In practice we mostly approach the Bayesian inference on the log scale: $$ \log(p(\theta, \Omega, \Sigma | y)) = C + \mathcal{L} + \log(p_0(\theta, \Omega, \Sigma)) $$ with constant C from the normalization factor.
The prior distribution is often solicited from literature and previous studies.
Maximum a posteriori estimation
Similar to that the MLE seeks the parameters that maximize the log-likelihood $$ (\hat{\theta}, \hat{\Omega}, \hat{\Sigma})_{\text{MLE}} = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,\mathcal{L}, $$ the maximum a posteriori (MAP) estimation seeks to maximize the log posterior distribution: $$ (\hat{\theta}, \hat{\Omega}, \hat{\Sigma})_{\text{MAP}} = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,\log(p(\theta, \Omega, \Sigma|y)) = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,(\mathcal{L}+\log(p_0(\theta, \Omega, \Sigma))). $$
In other words, the approach seeks the mode of \(p(\cdot|y)\). This is why MAP can also be interpreted as "mode a posteriori" estimation. The above equation shows that MAP is in effect MLE with an additional term in the objective function. The additional term represents prior knowledge of the parameters. To see its action, consider a a case with fixed \((\Omega, \Sigma)\) and only one unknown \(\theta\), and \(p_0(\theta)\) follows a normal distribution $$ \theta \sim \mathcal{N}(\xi, \sigma_0^2). $$ Then the MAP is equivalent to maximize the objective function $$ \mathcal{L} - \frac{1}{2} \left(\frac{\theta-\xi_0}{\sigma_0}\right)^2. $$ That is, the prior term penalizes the objective function by decreasing itself when \(\theta\) moves away from the mean \(\xi\), thus in order to maximize the objective function we must seek a balance between the prior term and the likelihood term. This is why the prior term is also referred to as penalty or regularization for MLE.
NONMEM provides $PRIOR control record to specify the prior. For
example, one can use
|
|
to specify using prior information from a previously run model's MSF. The model example below obtains parameter estimates from population data, using a two-compartment PK model and incorporating a prior of transformed normal form for all of THETA, OMEGA and SIGMA. The prior information is found in MSF msf1.
|
|
In the above model TNPRI (PROBLEM 2) indicates that the prior based
on msf1 will be imposed on PROBLEM 2, the second problem in the
control stream that uses default FO for estimation.
Empirical Bayes estimation of \(\eta\)
Recall that during the MLE the individual parameter \(\eta\) is marginalized out. We can use the MAP estimation to "recover" \(\eta\). This is done after (FO method) or during (FOCE method) population parameter estimation, and referred to as posthoc estimation.
Specifically, for subject i's \(\eta_i\), we use the estimated \(\hat{\Omega}\) for the prior (multivariate normal) distribution, and the estimated \((\hat{\theta}, \hat{\Sigma})\) for the likelihood, then we have the posterior probability density for \(\eta_i\) as $$ p(\eta_i | y_i) \propto p(y_i|\phi_i(\hat{\theta}, \eta_i, \hat{\Sigma}))p_0(\eta_i,\hat{\Omega}). $$ Then the MAP estimate of \(\eta_i\) is
$$ \hat{\eta}_i = \text{arg}\,\text{max}_{\eta_i}\log (p(\eta_i | y_i)) = \text{arg}\,\text{max}_{\eta_i} [\mathcal{L}_i(\eta_i)|_{\hat{\theta}, \hat{\Sigma}} + \log p_0(\eta_i,\hat{\Omega})] $$
Because the above \({\eta}_i\) prior is based on \(\Omega\) estimate which in turn is based on current observations, it is called empirical prior, and \(\hat{\eta}_i\) is called empirical bayes estimate (EBE). In NONMEM the term posthoc estimate and EBE is used interchangably.
MCMC sampling
As another Bayesian approach, sampling does not perform point estimation. Instead, it utilizes certain algorithm, aka sampler, to sample from the posterior distribution \(p(\cdot|y)\). With enough samples, we can have an approximation of the posterior probability distribution density (PDF) of \((\theta, \Omega, \Sigma)\), and calculate all the statistics therein.
Markov Chain Monte Carlo (MCMC) samplers are the most commonly used samplers.
NONMEM provides three flavours of them:
- Metropolis-Hasting (M-H) algorithm;
- Gibbs algorithm;
- No-U-Turn Sampler (NUTS) algorithm [1].
Specifically,
|
|
uses M-H and/or Gibbs samplers, and
|
|
uses the NUTS algorithm.
Compared to point estimation, in general the MCMC approach obtains more information (entire distribution rather than mean/mode estimation) by paying a computation premium. This trade-off should be examined on a case-by-case basis.
Regardless which sampler to use, prior information is still required for the MCMC methods. Consider the following two-compartment model example
|
|
We use NWPRI in $PRIOR to indicate that
the priors will be given manually through prior-specification control
records. Specifically, the $THETAP specifies that all elements of \(\theta\)
has a normally-distributed prior with mean at 2.0. Note that this is
not the prior mean for the individual PK parameters as they follow
log-normal distribution. The $THETAPV record specifies an diagonal covariance matrix
for the \(\theta\) prior, and each \(\theta\) has a very large prior
variance 2, showing the lack of prior information on \(\theta\).
Similarly, $OMEGAP and $OMEGAPD, $SIGMAP and $SIGMAPD, specify the elements and
degrees of freedom for OMEGA and SIGMA, respectively. The priors for
the two matrices follow inverse Wishart distribution.
Note that before using NUTS, we first do a short (NITER=0) Iterative
Two Stage (ITS) MLE run. This is to prepare position the sampler so
that the sampling is initiated in a reasonable proximity in the
posterior distribution (recall that the concatenate $EST records means
the estimation uses the previous run's results as starting point).
For the $EST METHOD=NUTS record, we see that there are 1000 samples
(NITER). That is, each unknown parameter will be sampled 1000 times.
These samples will be stored in the ".ext" file. BAYES_PHI_STORE=1
means in addition to \(\theta, \Omega, \Sigma)\) we also requesting
\(\phi\) samples. NONMEM outputs \(\phi\) samples in ".iph" files.
These files could be of large size because we are requesting 1000
samples for each subject \(\phi_i\) of the population. See
postprocessing MCMC draws on how to examine and interepret the sampling results.
Reference
[1] M. D. Hoffman and A. Gelman, The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo, J. Machine Learning Research 15, pp. 1593–1623
The normalization factor is a constant with respect to the parameters to make the posterior distribution integrated to 1. In Bayes' theorem it is in fact \(p(y)\), the marginal distribution of the data, aka "evidence".
In practice one may want to consider different variance for different \(\theta\) elements, as they correspond to different PK parameters, incorporating the transformation used. For example, based on the given prior mean and variance, CL has an unrealistically large prior mean of \(e^{2.0+10.0/2}\).