https://imgflip.com/i/1moa3j

Abstract

There are two forms of the Bayesian Information Criterion (BIC) a reader will encounter. One of these is of the form ln(N)k-2LL, the other of the form LL-0.5ln(N)k (where LL is the natural log of the model likelihood, N is the sample size, and k is the number of parameters in the model). Schwarz (1978) originally defined the BIC in the LL-0.5ln(N)k form, but currently the ln(N)k-2LL is the most commonly used definition of the BIC. One prominent author and data analysis context that continues to use the original Schwarz form (LL-0.5ln(N)k) is Daniel S. Nagin and his work and software in group-based trajectory modeling (SAS Proc Traj, Stata module traj). These two forms differ by a constant factor (-2) and are therefore interpreted differently. The data analyst must be very attentive to which version of the BIC is being calculated by the software being used, and interpret the BIC (and compare to other model BICs) using appropriate transformations and rules of thumb for the relevant form.

When the ln(N)k-2LL form is used, when comparing 2 or more models a lower BIC is preferred and a BIC difference of less than 2 is not worth more than a bare mention and a BIC difference of more than 10 is very strong evidence in favor of the model with the lower BIC. This would apply to data analysts working with Mplus and core estimation routines in SAS, Stata, and many in R. If the analyst is uncertain, a good (but fallible, so read the documentation) indicator that the ln(N)k-2LL form is being used is the BIC is a positive number.

When the LL-0.5ln(N)k form is being used, when comparing 2 or more models a higher BIC is preferred and a BIC difference of about 1 is not worth more than a bare mention and a BIC difference of more than 2.3 is very strong evidence in favor of the model with the higher BIC. This would apply to data analysts working with SAS Proc Traj or the Stata traj module, or any other program using the LL-0.5ln(N)k form of the BIC. If the analyst is uncertain, a good (but fallible, so read the documentation) indicator that the LL-0.5ln(N)k form is being used is the BIC is a negative number.

BIC flavor LL-0.5ln(N)k ln(N)k-2LL
Typical values Negative Positive
Better models higher BIC lower BIC
Bayes factor approximation[^1] \(\text{exp}(BIC_i-BIC_j)\) \(\text{exp}([BIC_i-BIC_j]/-2)\)

[^1] Bayes factor approximation for ln(N)k-2LL form is provided by Neath and Cavanaugh (2012) page 202.

A commonly used definition of the BIC: ln(N)k-2LL

The Bayesian Information Criterion (BIC) is a measure used to compare the quality of different statistical models (Schwarz, 1978). The BIC balances goodness of fit (with -2 times the log-likelihood, the maximum of a probability function of the data given a set of estimated parameters) and model complexity (number of parameters, k). The BIC is commonly defined as

\[BIC = -2 \times \text{ln(likelihood)} + \text{ln(N)} \times k\]

where \(k\) is the number of parameters and \(N\) is the sample size (number of observations). Preference is given to the smallest BIC. Log-likelihoods (\(\text{ln(likelihood)}\)) don’t themselves have intrinsic meaning, but as logs of probabilities are negative, and so the \(-2\) scaling makes the BIC positive as long as \(-2\text{ln(likelihood)}\) exceeds \(\text{ln(N)}k\).

I will call this the ln(N)k-2LL form of the BIC. Used to compare two (or more) models, the lower the BIC the better. A lower BIC value means the model with the lower BIC fits better despite model complexity. This version of the definition of the BIC can be found on Wikipedia, the Stata users guide, in at least two R packages (bayesbr::BIC_bayesbr, BayHap::BIC), and in Mplus and at least some contexts in SAS (c.f., Jones, 2010). Cavanaugh & Neath (1999) rederive the BIC to this ln(N)k-2LL form.

Schwarz’s BIC: LL-0.5ln(N)k

Schwarz (1978) expressed the BIC in his original 1978 publication following a different convention, that I will call the LL-0.5ln(N)k form:

A capture from Schwarz’s 1978 paper
A capture from Schwarz’s 1978 paper

A good deal of the literature on the BIC refers to this LL-0.5ln(N)k form, other papers refer to the ln(N)k-2LL form, and it is not always clear which form is being referred to. Readers must be attentive to this issue, as the practical interpretation of the BIC in comparing models differs between the two forms.

Daniel Nagin prefers LL-0.5ln(N)k

One of the authors that uses the LL-0.5ln(N)k form is Daniel Nagin in his highly influential book Group-Based Modeling of Development (2005).

From Nagin’s book
From Nagin’s book

While popular statistical analysis programs use the ln(N)k-2LL as their “main” form of the BIC, some procedures, and at least those related to trajectory modeling and authored by Nagin and his colleague Jones, use the LL-0.5ln(N)k form. For example, SAS Proc Traj gives BIC in the LL-0.5ln(N)k form (eg) where other SAS procedures provide the BIC in the ln(N)k-2LL form (eg). This discrepancy is surely driven by Nagin’s preference for the LL-0.5ln(N)k form and his association with Proc Traj. Similarly in Stata, the Jones and Nagin module for trajectory modeling is in the LL-0.5ln(N)k form (Jones & Nagin 2013).

The relationship between the two forms of the BIC

The two forms of the BIC differ by a scaling factor of \(-2\). If we let \(BIC'\) represent the BIC defined above (ln(N)}k-2LL; the “modern” or “common” form of the BIC), and \(BIC\) represent the original Schwarz BIC (and popularized by Nagin; LL-0.5ln(N)k):

\[BIC' = -2 BIC = -2(\text{ln}(L) - 0.5 \times \text{ln}(N) \times k) = -2\text{LL} + \text{ln}(N) \times k\]

\[BIC = \frac{BIC'}{-2} = \frac{-2\text{ln}(L) + \text{ln}(N)k}{-2} = LL - 0.5\text{ln}(N)k\]

Why did this change happen?

I’m not sure and neither is r/askstatistics. The best answer is that the -2LL + Ln(N)k form looks more like the Akaike information criterion (AIC) (u/DatYungChebyshev420). Akaike (1973) defined the AIC as:

\[AIC = -2\text{ln}(L) + 2k\]

A capture from Akaike’s 1973 paper
A capture from Akaike’s 1973 paper

Where it can be seen that the complexity of a model is represented with \(2k\) in the AIC and \(\text{ln}(N)k\) in the BIC.

Bayes factor

Nagin (2005) describes the Bayes factor. Note that Nagin defines the BIC using the LL-0.5ln(N)k form Nagin (2005) (page 68-70):

The Bayes factor (\(B_{ij}\)) measures the posterior odds of i being the correct model given the data. Subscripts i and j refer to competing models. \(B_{ij}\) is computed as the ratio of the probability of i being the correct model to j being the correct model. A Bayes factor of 1, therefore, implies that the models are equally likely, whereas a Bayes factor of 10 implies that model i is 10 times more likely than j. Computation of the Bayes factor is generally very difficult and indeed commonly impossible. Schwartz (1978) and Kass and Wasserman (1995) show that \(\text{exp}(BIC_i-BIC_j)\) is a good approximation if there is no a priori reason to believe model i or j is superior. By extension, if \(BIC_{max}\) is the maximum BIC score of the J different models under consideration, then the probability that a model j is the correct model given the data (pj) is:>

\[p_{j}= \frac{e^{BIC_j-BIC_{max}}}{\sum_je^{BIC_j-BIC_{max}}}\]

Neath & Cavanaugh (2012) rework the model posterior probability calculation for the BIC in ln(N)k-2LL form (Neath & Cavanaugh’s notation):

\[P(k|y) = \frac{\text{exp}(\Delta_k/-2)}{\sum\limits_{l=1}^L\text{exp}(\Delta_l/-2)}\] where we have a set of \(L\) models \(M_{k_1},M_{k_2}\ldots M_{k_l}\) , \(k\) represents a parameter space and \(y\) observed data, and \(\Delta_{k}= BIC(k)-BIC(k_*)\), where \(BIC(k_*)\) is the minimum BIC value across the candidate models, and where the BIC is computed using the ln(N)k-2LL form.

Two sets of criteria for interpreting Bayes factors and differences in competing models

Consider the following table, which is Table 4.2 From Nagin (2005) (page 69)

Jeffreys scale of evidence for Bayes Factors

Bayes factor Interpretation
\(B_{ij} < 0.1\) Strong evidence for model j
\(0.1 < B_{ij} < 0.3\) Moderate evidence for model j
\(0.3 < B_{ij} < 1.0\) Weak evidence for model j
\(1.0 < B_{ij} < 3.0\) Weak evidence for model i
\(3.0 < B_{ij} < 10\) Moderate evidence for model i
\(10 < B_{ij}\) Strong evidence for model i

Where the Bayes factor is approximated with \(B_{ij} = exp(BIC_{i}- BIC_j)\), since Nagin is working with the LL-0.5ln(N)k form of the BIC. I can convert those numbers to consider \(\text{ln}(B_{ij}) = BIC_j-BIC_j\)

\(\text{ln}(B_{ij}) = BIC_j-BIC_j\) Interpretation
\(\text{ln}(B_{ij}) < -2.3\) Strong evidence for model j
\(-2.3 < \text{ln}(B_{ij}) < -1.2\) Moderate evidence for model j
\(-1.2 < \text{ln}(B_{ij}) < 0\) Weak evidence for model j
\(0 < \text{ln}(B_{ij}) < 1.1\) Weak evidence for model i
\(1.1 < \text{ln}(B_{ij}) < 2.3\) Moderate evidence for model i
\(2.3 < \text{ln}(B_{ij})\) Strong evidence for model i

This set of criteria line up with the first table in section 3.2 of Kass and Raftery (1995) (p 777)

However Kass and Raftery (1995) give another set of thresholds for the Bayes factor

That \(2\text{ln}(B_{ij})\) column corresponds with Neath & Cavanaugh (2012)’s Table 1

And here is why. Neath & Cavanaugh (2021) are thinking the BIC in the ln(N)k-2LL form. In that form, the Bayes factor (\(B_{ij}\)) is approximated with

\[ \begin{align*} B_{ij} &\approx \text{exp}([BIC_i-BIC_j]/-2)\\ \text{ln}(B_{ij}) &\approx [BIC_i-BIC_j]/-2\\ -2\text{ln}(B_{ij}) & \approx BIC_{i}- BIC_j \end{align*} \]

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Cavanaugh, J. E., & Neath, A. A. (1999). Generalizing the derivation of the Schwarz information criterion. Communications in Statistics-Theory and Methods, 28(1), 49-66.

Jones B. L., Nagin D. S. 2007. “Advances in Group-based Trajectory Modeling and an SAS Procedure for Estimating Them.” Sociological Methods & Research 35:542–71.

Jones, B. L., & Nagin, D. S. (2013). A Note on a Stata Plugin for Estimating Group-based Trajectory Models. Sociological Methods & Research, 42(4), 608-613. https://doi.org/10.1177/0049124113503141

Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773-795.

Nagin, D. (2005). Group-based modeling of development. Harvard University Press.

Neath, A. A., & Cavanaugh, J. E. (2012). The Bayesian information criterion: background, derivation, and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2), 199-203.

Schwartz G. 1978. Estimating dimensions of a model. Annals of Statistics. 6:461-464