There are two forms of the Bayesian Information Criterion (BIC) a
reader will encounter. One of these is of the form
ln(N)k-2LL
, the other of the form LL-0.5ln(N)k
(where LL
is the natural log of the model likelihood, N is
the sample size, and k is the number of parameters in the model).
Schwarz (1978) originally defined the BIC in the
LL-0.5ln(N)k
form, but currently the
ln(N)k-2LL
is the most commonly used definition of the BIC.
One prominent author and data analysis context that continues to use the
original Schwarz form (LL-0.5ln(N)k
) is Daniel S. Nagin and
his work and software in group-based trajectory modeling (SAS Proc Traj,
Stata module traj
). These two forms differ by a constant
factor (-2) and are therefore interpreted differently. The data analyst
must be very attentive to which version of the BIC is being calculated
by the software being used, and interpret the BIC (and compare to other
model BICs) using appropriate transformations and rules of thumb for the
relevant form.
When the ln(N)k-2LL
form is used, when comparing 2 or
more models a lower BIC is preferred and a BIC difference of less than 2
is not worth more than a bare mention and a BIC difference of more than
10 is very strong evidence in favor of the model with the lower BIC.
This would apply to data analysts working with Mplus and core estimation
routines in SAS, Stata, and many in R. If the analyst is uncertain, a
good (but fallible, so read the documentation) indicator that the
ln(N)k-2LL
form is being used is the BIC is a positive
number.
When the LL-0.5ln(N)k
form is being used, when comparing
2 or more models a higher BIC is preferred and a BIC difference of about
1 is not worth more than a bare mention and a BIC difference of more
than 2.3 is very strong evidence in favor of the model with the higher
BIC. This would apply to data analysts working with SAS Proc Traj or the
Stata traj
module, or any other program using the
LL-0.5ln(N)k
form of the BIC. If the analyst is uncertain,
a good (but fallible, so read the documentation) indicator that the
LL-0.5ln(N)k
form is being used is the BIC is a negative
number.
BIC flavor | LL-0.5ln(N)k |
ln(N)k-2LL |
---|---|---|
Typical values | Negative | Positive |
Better models | higher BIC | lower BIC |
Bayes factor approximation[^1] | \(\text{exp}(BIC_i-BIC_j)\) | \(\text{exp}([BIC_i-BIC_j]/-2)\) |
[^1] Bayes factor approximation for ln(N)k-2LL
form is
provided by Neath and Cavanaugh (2012) page 202.
ln(N)k-2LL
The Bayesian Information Criterion (BIC) is a measure used to compare the quality of different statistical models (Schwarz, 1978). The BIC balances goodness of fit (with -2 times the log-likelihood, the maximum of a probability function of the data given a set of estimated parameters) and model complexity (number of parameters, k). The BIC is commonly defined as
\[BIC = -2 \times \text{ln(likelihood)} + \text{ln(N)} \times k\]
where \(k\) is the number of parameters and \(N\) is the sample size (number of observations). Preference is given to the smallest BIC. Log-likelihoods (\(\text{ln(likelihood)}\)) don’t themselves have intrinsic meaning, but as logs of probabilities are negative, and so the \(-2\) scaling makes the BIC positive as long as \(-2\text{ln(likelihood)}\) exceeds \(\text{ln(N)}k\).
I will call this the ln(N)k-2LL
form of the BIC. Used to
compare two (or more) models, the lower the BIC the
better. A lower BIC value means the model with the lower BIC
fits better despite model complexity. This version of the definition of
the BIC can be found on Wikipedia,
the Stata users
guide, in at least two R packages (bayesbr::BIC_bayesbr,
BayHap::BIC), and
in Mplus and
at least some contexts in SAS (c.f., Jones,
2010). Cavanaugh
& Neath (1999) rederive the BIC to this ln(N)k-2LL
form.
LL-0.5ln(N)k
Schwarz (1978) expressed the BIC in his original 1978 publication
following a different convention, that I will call the
LL-0.5ln(N)k
form:
A good deal of the literature on the BIC refers to this
LL-0.5ln(N)k
form, other papers refer to the
ln(N)k-2LL
form, and it is not always clear which form is
being referred to. Readers must be attentive to this issue, as the
practical interpretation of the BIC in comparing models differs between
the two forms.
One of the authors that uses the LL-0.5ln(N)k
form is
Daniel Nagin in his highly influential book Group-Based Modeling of
Development (2005).
While popular statistical analysis programs use the
ln(N)k-2LL
as their “main” form of the BIC, some
procedures, and at least those related to trajectory modeling and
authored by Nagin and his colleague Jones, use the
LL-0.5ln(N)k
form. For example, SAS Proc Traj gives BIC in
the LL-0.5ln(N)k
form (eg) where
other SAS procedures provide the BIC in the ln(N)k-2LL
form
(eg).
This discrepancy is surely driven by Nagin’s preference for the
LL-0.5ln(N)k
form and his association with Proc Traj.
Similarly in Stata, the Jones and Nagin module for trajectory modeling
is in the LL-0.5ln(N)k
form (Jones & Nagin
2013).
The two forms of the BIC differ by a scaling factor of \(-2\). If we let \(BIC'\) represent the BIC defined above
(ln(N)}k-2LL
; the “modern” or “common” form of the BIC),
and \(BIC\) represent the original
Schwarz BIC (and popularized by Nagin; LL-0.5ln(N)k
):
\[BIC' = -2 BIC = -2(\text{ln}(L) - 0.5 \times \text{ln}(N) \times k) = -2\text{LL} + \text{ln}(N) \times k\]
\[BIC = \frac{BIC'}{-2} = \frac{-2\text{ln}(L) + \text{ln}(N)k}{-2} = LL - 0.5\text{ln}(N)k\]
I’m not sure and neither is r/askstatistics.
The best answer is that the -2LL + Ln(N)k
form looks more
like the Akaike information criterion (AIC) (u/DatYungChebyshev420).
Akaike (1973) defined the AIC as:
\[AIC = -2\text{ln}(L) + 2k\]
Where it can be seen that the complexity of a model is represented with \(2k\) in the AIC and \(\text{ln}(N)k\) in the BIC.
Nagin (2005) describes the Bayes factor. Note that Nagin defines the
BIC using the LL-0.5ln(N)k
form Nagin (2005) (page
68-70):
The Bayes factor (\(B_{ij}\)) measures the posterior odds of i being the correct model given the data. Subscripts i and j refer to competing models. \(B_{ij}\) is computed as the ratio of the probability of i being the correct model to j being the correct model. A Bayes factor of 1, therefore, implies that the models are equally likely, whereas a Bayes factor of 10 implies that model i is 10 times more likely than j. Computation of the Bayes factor is generally very difficult and indeed commonly impossible. Schwartz (1978) and Kass and Wasserman (1995) show that \(\text{exp}(BIC_i-BIC_j)\) is a good approximation if there is no a priori reason to believe model i or j is superior. By extension, if \(BIC_{max}\) is the maximum BIC score of the J different models under consideration, then the probability that a model j is the correct model given the data (pj) is:>
\[p_{j}= \frac{e^{BIC_j-BIC_{max}}}{\sum_je^{BIC_j-BIC_{max}}}\]
Neath & Cavanaugh (2012) rework the model posterior probability
calculation for the BIC in ln(N)k-2LL
form (Neath &
Cavanaugh’s notation):
\[P(k|y) =
\frac{\text{exp}(\Delta_k/-2)}{\sum\limits_{l=1}^L\text{exp}(\Delta_l/-2)}\]
where we have a set of \(L\) models
\(M_{k_1},M_{k_2}\ldots M_{k_l}\) ,
\(k\) represents a parameter space and
\(y\) observed data, and \(\Delta_{k}= BIC(k)-BIC(k_*)\), where \(BIC(k_*)\) is the minimum BIC value across
the candidate models, and where the BIC is computed using the
ln(N)k-2LL
form.
Consider the following table, which is Table 4.2 From Nagin (2005) (page 69)
Jeffreys scale of evidence for Bayes Factors
Bayes factor | Interpretation |
---|---|
\(B_{ij} < 0.1\) | Strong evidence for model j |
\(0.1 < B_{ij} < 0.3\) | Moderate evidence for model j |
\(0.3 < B_{ij} < 1.0\) | Weak evidence for model j |
\(1.0 < B_{ij} < 3.0\) | Weak evidence for model i |
\(3.0 < B_{ij} < 10\) | Moderate evidence for model i |
\(10 < B_{ij}\) | Strong evidence for model i |
Where the Bayes factor is approximated with \(B_{ij} = exp(BIC_{i}- BIC_j)\), since Nagin
is working with the LL-0.5ln(N)k
form of the BIC. I can
convert those numbers to consider \(\text{ln}(B_{ij}) = BIC_j-BIC_j\)
\(\text{ln}(B_{ij}) = BIC_j-BIC_j\) | Interpretation |
---|---|
\(\text{ln}(B_{ij}) < -2.3\) | Strong evidence for model j |
\(-2.3 < \text{ln}(B_{ij}) < -1.2\) | Moderate evidence for model j |
\(-1.2 < \text{ln}(B_{ij}) < 0\) | Weak evidence for model j |
\(0 < \text{ln}(B_{ij}) < 1.1\) | Weak evidence for model i |
\(1.1 < \text{ln}(B_{ij}) < 2.3\) | Moderate evidence for model i |
\(2.3 < \text{ln}(B_{ij})\) | Strong evidence for model i |
This set of criteria line up with the first table in section 3.2 of Kass and Raftery (1995) (p 777)
However Kass and Raftery (1995) give another set of thresholds for the Bayes factor
That \(2\text{ln}(B_{ij})\) column corresponds with Neath & Cavanaugh (2012)’s Table 1
And here is why. Neath & Cavanaugh (2021) are thinking the BIC in
the ln(N)k-2LL
form. In that form, the Bayes factor (\(B_{ij}\)) is approximated with
\[ \begin{align*} B_{ij} &\approx \text{exp}([BIC_i-BIC_j]/-2)\\ \text{ln}(B_{ij}) &\approx [BIC_i-BIC_j]/-2\\ -2\text{ln}(B_{ij}) & \approx BIC_{i}- BIC_j \end{align*} \]
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.
Cavanaugh, J. E., & Neath, A. A. (1999). Generalizing the derivation of the Schwarz information criterion. Communications in Statistics-Theory and Methods, 28(1), 49-66.
Jones B. L., Nagin D. S. 2007. “Advances in Group-based Trajectory Modeling and an SAS Procedure for Estimating Them.” Sociological Methods & Research 35:542–71.
Jones, B. L., & Nagin, D. S. (2013). A Note on a Stata Plugin for Estimating Group-based Trajectory Models. Sociological Methods & Research, 42(4), 608-613. https://doi.org/10.1177/0049124113503141
Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773-795.
Nagin, D. (2005). Group-based modeling of development. Harvard University Press.
Neath, A. A., & Cavanaugh, J. E. (2012). The Bayesian information criterion: background, derivation, and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2), 199-203.
Schwartz G. 1978. Estimating dimensions of a model. Annals of Statistics. 6:461-464