Bayesian methods

Recall that the classical NONMEM MLE methods seek parameters $(\theta, \Omega, \Sigma)$ that maximize the log-likelihood $\mathcal{L}=\log(L)$. With Bayesian methods we instead seek the parameters using the prior information in addition to the likelihood, based on the Bayes' theorem: $$ p(\theta, \Omega, \Sigma | y) \propto L(\theta, \Omega, \Sigma)p_0(\theta, \Omega, \Sigma). $$ Here the left-hand size $p(\cdot|y)$ is called posterior distribution of the parameters, and $p_0(\cdot)$ the prior distribution. Note that similar to the MLE approach presentation the $\propto$ sign is used because the normalization factor is omitted ¹.

In practice we mostly approach the Bayesian inference on the log scale: $$ \log(p(\theta, \Omega, \Sigma | y)) = C + \mathcal{L} + \log(p_0(\theta, \Omega, \Sigma)) $$ with constant C from the normalization factor.

The prior distribution is often solicited from literature and previous studies.

Maximum a posteriori estimation

Similar to that the MLE seeks the parameters that maximize the log-likelihood $$ (\hat{\theta}, \hat{\Omega}, \hat{\Sigma})_{\text{MLE}} = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,\mathcal{L}, $$ the maximum a posteriori (MAP) estimation seeks to maximize the log posterior distribution: $$ (\hat{\theta}, \hat{\Omega}, \hat{\Sigma})_{\text{MAP}} = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,\log(p(\theta, \Omega, \Sigma|y)) = \text{arg}\,\text{max}_{\theta, \Omega, \Sigma}\,(\mathcal{L}+\log(p_0(\theta, \Omega, \Sigma))). $$

In other words, the approach seeks the mode of $p(\cdot|y)$. This is why MAP can also be interpreted as "mode a posteriori" estimation. The above equation shows that MAP is in effect MLE with an additional term in the objective function. The additional term represents prior knowledge of the parameters. To see its action, consider a a case with fixed $(\Omega, \Sigma)$ and only one unknown $\theta$, and $p_0(\theta)$ follows a normal distribution $$ \theta \sim \mathcal{N}(\xi, \sigma_0^2). $$ Then the MAP is equivalent to maximize the objective function $$ \mathcal{L} - \frac{1}{2} \left(\frac{\theta-\xi_0}{\sigma_0}\right)^2. $$ That is, the prior term penalizes the objective function by decreasing itself when $\theta$ moves away from the mean $\xi$, thus in order to maximize the objective function we must seek a balance between the prior term and the likelihood term. This is why the prior term is also referred to as penalty or regularization for MLE.

NONMEM provides $PRIOR control record to specify the prior. For example, one can use

1

$PRIOR TNPRI

to specify using prior information from a previously run model's MSF. The model example below obtains parameter estimates from population data, using a two-compartment PK model and incorporating a prior of transformed normal form for all of THETA, OMEGA and SIGMA. The prior information is found in MSF msf1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


   $PROB   READ THE MODEL SPECIFICATION FILE
   $DATA data
   $INPUT ID DOSE TIME DV WT
   $PRIOR TNPRI (PROBLEM 2)
   $MSFI msf1 ONLYREAD
   $PRED
         IF (NEWIND.NE.2) THEN
            AMT=DOSE*WT
            W=WT
         ENDIF
         T0=THETA(1)*EXP(ETA(1))
         T2=THETA(2)*EXP(ETA(2))
         T1=T2+T0
         T3=THETA(3)*W*EXP(ETA(3))
         A=EXP(-T2*TIME)
         B=EXP(-T1*TIME)
         C=T1-T2
         D=A-B
         E=T3*C
         Y=AMT*T1*T2/E*D+EPS(1)

   $PROB POPULATION DATA WITH PRIOR ON THETA, OMEGA AND SIGMA
   $INPUT ID DOSE TIME DV WT
   $DATA data  REWIND

   $THETA  (0,4,5) (0,.09,.5) (.004,.01,.9)
   $OMEGA BLOCK (3) .7 .04 .05 .02 .06 .08
   $SIGMA  .4

   $EST

In the above model TNPRI (PROBLEM 2) indicates that the prior based on msf1 will be imposed on PROBLEM 2, the second problem in the control stream that uses default FO for estimation.

Empirical Bayes estimation of $\eta$

Recall that during the MLE the individual parameter $\eta$ is marginalized out. We can use the MAP estimation to "recover" $\eta$. This is done after (FO method) or during (FOCE method) population parameter estimation, and referred to as posthoc estimation.

Specifically, for subject i's $\eta_i$, we use the estimated $\hat{\Omega}$ for the prior (multivariate normal) distribution, and the estimated $(\hat{\theta}, \hat{\Sigma})$ for the likelihood, then we have the posterior probability density for $\eta_i$ as $$ p(\eta_i | y_i) \propto p(y_i|\phi_i(\hat{\theta}, \eta_i, \hat{\Sigma}))p_0(\eta_i,\hat{\Omega}). $$ Then the MAP estimate of $\eta_i$ is

$$ \hat{\eta}_i = \text{arg}\,\text{max}_{\eta_i}\log (p(\eta_i | y_i)) = \text{arg}\,\text{max}_{\eta_i} [\mathcal{L}_i(\eta_i)|_{\hat{\theta}, \hat{\Sigma}} + \log p_0(\eta_i,\hat{\Omega})] $$

Because the above ${\eta}_i$ prior is based on $\Omega$ estimate which in turn is based on current observations, it is called empirical prior, and $\hat{\eta}_i$ is called empirical bayes estimate (EBE). In NONMEM the term posthoc estimate and EBE is used interchangably.

MCMC sampling

As another Bayesian approach, sampling does not perform point estimation. Instead, it utilizes certain algorithm, aka sampler, to sample from the posterior distribution $p(\cdot|y)$. With enough samples, we can have an approximation of the posterior probability distribution density (PDF) of $(\theta, \Omega, \Sigma)$, and calculate all the statistics therein.

Markov Chain Monte Carlo (MCMC) samplers are the most commonly used samplers.

NONMEM provides three flavours of them:

Metropolis-Hasting (M-H) algorithm;
Gibbs algorithm;
No-U-Turn Sampler (NUTS) algorithm [1].

Specifically,

1

$EST BAYES

uses M-H and/or Gibbs samplers, and

1

$EST NUTS

uses the NUTS algorithm.

Compared to point estimation, in general the MCMC approach obtains more information (entire distribution rather than mean/mode estimation) by paying a computation premium. This trade-off should be examined on a case-by-case basis.

Regardless which sampler to use, prior information is still required for the MCMC methods. Consider the following two-compartment model example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


$PROB Two compartment Model
$INPUT C SET ID JID TIME  DV=CONC AMT=DOSE RATE EVID MDV CMT
$DATA w15.csv IGNORE=@

$SUBROUTINES ADVAN3 TRANS4

$PK
MU_1=THETA(1)
MU_2=THETA(2)
MU_3=THETA(3)
MU_4=THETA(4)
;# Etas are always normally distruted with mean 0, variance OMEGA().
;# phis (mu+eta) are normally distruted with mean MU, and variance OMEGA()
;# Individual parameters will be whatever the transformation is.  In this case, transformatioin is exponential of a
;# normally distributed parameter (phi), so individual parameters are log-normally distributed.
CL=DEXP(MU_1+ETA(1))
V1=DEXP(MU_2+ETA(2))
Q=DEXP(MU_3+ETA(3))
V2=DEXP(MU_4+ETA(4))
S1=V1

$ERROR
;# Data DV are normally distributed, with residual variance F*F*SIGMA(1,1), and its SD is F*SQRT(SIGMA(1,1))
;#  This is because EPS() is scaled with sqrt(SIGMA)
Y = F + F*EPS(1)

;# Initial values of THETA
$THETA
 2.0 ;#[LN(CL)]
 2.0 ;#[LN(V1)]
 2.0 ;#[LN(Q)]
 2.0 ;#[LN(V2)]
;#INITIAL values of OMEGA
$OMEGA BLOCK(4)
0.15   ;#[P]
0.01  ;#[F]
0.15   ;#[P]
0.01  ;#[F]
0.01  ;#[F]
0.15   ;#[P]
0.01  ;#[F]
0.01  ;#[F]
0.01  ;#[F]
0.15   ;#[P]
;#Initial value of SIGMA
$SIGMA
(0.6 )   ;#[P]

$PRIOR NWPRI
;# Prior information of THETAS.  Normal distribution of thetas assumed unless $TTDF is set, in which case it is then T distriuted
$THETAP (2.0 FIX) (2.0 FIX) (2.0 FIX) (2.0 FIX)
;# Variance to prior information of THETAS.
$THETAPV BLOCK(4) FIXED VALUES(10.0,0.0)
;# Inverse Wishart distribution assumed, unless $OLKJDF, $OVARF are set.
$OMEGAP BLOCK(4) FIXED VALUES(0.15,0.0)
;# Low degrees of freedom, equivalent to block dimension, means low information.
$OMEGAPD (4 FIX)
;# Prior information to the SIGMAS.
;# Inverse Wishart distribution assumed, unless $SLKJDF, $SVARF
$SIGMAP (0.06 FIXED)
$SIGMAPD (1 FIXED)

$EST METHOD=ITS INTERACTION NITER=0 NOABORT NOPRIOR=1 file=w15_its.ext
$EST METHOD=NUTS AUTO=1 PRINT=100 NITER=1000 NOPRIOR=0 BAYES_PHI_STORE=1 file=w15.ext

We use NWPRI in $PRIOR to indicate that the priors will be given manually through prior-specification control records. Specifically, the $THETAP specifies that all elements of $\theta$ has a normally-distributed prior with mean at 2.0. Note that this is not the prior mean for the individual PK parameters as they follow log-normal distribution. The $THETAPV record specifies an diagonal covariance matrix for the $\theta$ prior, and each $\theta$ has a very large prior variance ², showing the lack of prior information on $\theta$.

Similarly, $OMEGAP and $OMEGAPD, $SIGMAP and $SIGMAPD, specify the elements and degrees of freedom for OMEGA and SIGMA, respectively. The priors for the two matrices follow inverse Wishart distribution.

Note that before using NUTS, we first do a short (NITER=0) Iterative Two Stage (ITS) MLE run. This is to prepare position the sampler so that the sampling is initiated in a reasonable proximity in the posterior distribution (recall that the concatenate $EST records means the estimation uses the previous run's results as starting point).

For the $EST METHOD=NUTS record, we see that there are 1000 samples (NITER). That is, each unknown parameter will be sampled 1000 times. These samples will be stored in the ".ext" file. BAYES_PHI_STORE=1 means in addition to $\theta, \Omega, \Sigma)$ we also requesting $\phi$ samples. NONMEM outputs $\phi$ samples in ".iph" files. These files could be of large size because we are requesting 1000 samples for each subject $\phi_i$ of the population. See postprocessing MCMC draws on how to examine and interepret the sampling results.

Reference

[1] M. D. Hoffman and A. Gelman, The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo, J. Machine Learning Research 15, pp. 1593–1623

The normalization factor is a constant with respect to the parameters to make the posterior distribution integrated to 1. In Bayes' theorem it is in fact $p(y)$, the marginal distribution of the data, aka "evidence".

In practice one may want to consider different variance for different $\theta$ elements, as they correspond to different PK parameters, incorporating the transformation used. For example, based on the given prior mean and variance, CL has an unrealistically large prior mean of $e^{2.0+10.0/2}$.