last update: November 11 2021

Abstract and summary

If you’re using a categorical dependent variable confirmatory factor analysis model with covariates to identify differential item function (e.g., mplusmimic), and using one of three common-scale estimators (WLSMV with theta parameterization, MLR with probit link, or Bayes) and have constrained the residual variance of the underlying latent trait such that the variance of the latent trait is 1.0, then you can use thresholds of 0.1 and 0.375 to classify direct effects (absolute value) as negligible, slight to moderate, and moderate to large and be reasonably consistent with ETS DIF categories.

ETS DIF Categories

The ETS had a system for classifying detected differential item functioning (DIF) according to magnitude and direction (Zwick, 2012):

DIF Category Description
A Negligible or nonsignificant DIF
B+, B- Slight to moderate DIF
C+, C- Moderate to large DIF

The + and - qualifications indicated the direction of DIF: B- was an item with slight to moderate DIF that is more difficult for the focal group, C+ DIF is an item with moderate to large DIF that is more difficult for the reference group. Zwick (2012) reports that (page 10-11)

A 1988 [ETS] memorandum states that, in general, “Items from category A should be selected in preference to to items from categories B or C … For items in category B, when there is a choice among otherwise equally appropriate items, then items with smaller absolute MH D-DIF values should be selected…Items from Category C will NOT be used unless they are judged to be essential to meet test specifications

ETS DIF Category Criteria are based on the MH D-DIF statistic

The criteria for detected DIF falling into the A, B, and C categories was based on the MH D-DIF statistic, which is

\[MH \ D\text{-}DIF = -2.35×\text{ln}(OR_{MH})\]

where \(OR_{MH}\) is the summary odds ratio resulting from the Mantel Haenszel stratified (typically stratified on a coarsened total score) 2×2 table approach to DIF detection. Zwick et al (1994) also provide that \(SE(MH \ D\text{-}DIF)\) is \(2.35\left\{\text{VAR}\left[\text{ln}(\hat{OR}_{\text{MH}})\right]\right\}^{1/2}\). The scaling constant 2.35 represents an attempt to place the \(\text{ln}(OR)\) on the delta scale of item difficulty. The delta scale is a scale with a mean of 13 and standard deviation of 4. Item difficulties (in the classical test theory sense of item difficulties) are placed on the delta scale with

\[\Delta_i = 13 - 4\Phi^{-1}(\bar{p}_i)\] where \(\bar{p}_i\) is the difficulty or p-value for item \(i\): the proportion correct, and \(\Phi^{-1}(\cdot)\) is the inverse function of the cumulative normal distribution. The 2.35 in MH D-DIF is \(4/1.7\), where \(1.7\) is a constant used commonly in item response theory applications to make estimates on a logit scale more comparable to a normal probability or probit scale (see Camilli, 1994).

ETS DIF Category Definitions

DIF Category Criteria
A \(MH \ D\text{-}DIF\) is not significantly different from 0 at 5% significance level (two-tailed), or MH D-DIF statistic is smaller than 1 in absolute value (i.e., \(OR_{MH}\) is between 0.653 and 1.53)
B+, B- Neither A nor C DIF: DIF that is either (i) not significantly different from 0 (two-tailed 5% significance level), or (ii) DIF that is significantly different from zero but \(|MH \ D\text{-}DIF|\) is \(< 1.5\) (i.e., \(0.528 < \text{ln}(OR_{MH}) < 1.893\)) and/or \(|MH \ D\text{-}DIF|\) is not significantly greater than 1 (at the one-tailed 5% level)
C+, C- The \(|MH \ D\text{-}DIF|\) must be significantly (at the 5% level, one-tailed) greater than 1 in absolute value [NB: this is not significantly greater than the null of the \(MH \ D\text{-}DIF\) statistic, which would be 0] and must have an absolute value of 1.5 or more (i.e., an \(OR_{MH}\) less than 0.528 or greater than 1.893

The relationship between MH D-DIF and other methods of DIF detection with an emphasis on the Mplus categorical data CFA approach

The term \(ln(OR_{MH})\) approximates other more commonly encountered DIF indicators, including the group effect in the logistic regression approach to DIF (e.g., lordif) when the group-by-ability interaction is 0, the direct effect in a so-called MIMIC (multiple indicators, multiple causes) model using MLR/Logit, and the difference in thresholds in a multiple group categorical CFA model when the estimator is MLR/probit or WLSMV/probit(delta). Here is how we compute a MH-D-DIF effect size statistic using various Mplus estimating options in a categorical dependent variable CFA model with a grouping variable as a predictor, which is directly related to a given item with parameter \(\kappa\) and with measurement slope \(\lambda\):

\[\begin{align} \Delta_i^{\text{reference}}-\Delta_i^{\text{focal}} &= 4\left[\Phi^{-1}(\bar{p}_i^{\text{reference}}) - \Phi^{-1}(\bar{p}_i^{\text{focal}})\right] \\ &\approx 2.35 \times ln(OR_{MH}) \\ &\approx 2.35 \times \kappa^{\text{MLR,logit}} \\ &\approx 4 \times \kappa^{\text{MLR,probit}} \\ &\approx 4 \times \kappa^{\text{WLSMV,theta}} \\ &\approx 4 \times \kappa^{\text{Bayes}} \end{align}\]

And if we are using WLSMV/delta, the MH-D-DIF statistic is, I think, pretty close to

\[ \frac{4\kappa}{\sqrt{1-\lambda^2}} \]

Assuming the underlying latent trait has a variance of 1 (and not a residual variance of 1; constrain the residual variance to 1-\(R^2\)).

The math

Here is an explanation of how a direct effect in a MIMIC model of 0.25 corresponds to a MH D-DIF value of 1.0, when the estimator is one of three using a common scaling (WLSMV/theta, MLR/probit, and Bayes) using Mplus software, and when the latent variable variance is 1.0 (see Macintosh and Hashim, 2003). First, we know the MH D-DIF statistic is the difference in item response probabilities (here we are talking about dichotomous items) expressed on the delta scale. The delta scale has a standard deviation of 4. When we use \(2.35\text{ln}(OR_{MH})\) we are taking a difference in proportions on the log odds scale (\(text{ln}(OR_{MH})\)) and putting it it on a normal probability scale by dividing by 1.7, and then placing it on a delta scale by multiplying by 4. Or more simply, multiplying by \(4/1.7 ≈ 2.35\). In the three target MIMIC model estimator/parameterization or link function approaches mentioned above, the direct effect of the item on the grouping variable (symbolized \(\kappa\)) is already scaled on a normal probability metric. So all we need to do is multiply by 4 to get the \(\kappa\) statistic on a delta scale. \[MH \ D\text{-}DIF \approx 4\kappa\] Or, we can give up the delta scale and just translate the ETS DIF category thresholds to the normal probability metric. In this case, we divide the MH D-DIF thresholds by 4.

ETS DIF using Mplus categorical variable CFA with covariates

ETS MLR/probit, WLSMV/theta, Bayes
A i) \(|\kappa|/\text{SE}(\kappa) < 1.96\), OR
ii) \(|\kappa|<0.10\)
C i) \(|\kappa|/\text{SE}(\kappa) \ge 1.96\), AND
ii) \((|\kappa|-0.25)/\text{SE}(\kappa) \ge 1.645\) (i.e., the lower 90% confidence interval on \(\kappa\) is greater than 0.25, or the upper 90% confidence interval is less than -0.25), AND
iii) \(|\kappa| > 0.375\)
B Criteria for A and C not met

Critique of ETS DIF criteria

The ETS DIF categories and criteria are complicated. They mix togeter effect size and statistical significance. I don’t like that very much. Here’s a graphical illustration of DIF categorization based on direct effects (\(\kappa\)) values and the standard error of that direct estimate:

Figure 1

You can have an item with a large direct effect (\(\kappa\) of 0.5) that is classified as “B” or “A” DIF because the effect is estimated imprecisely or can’t be ruled out as a chance finding. Classifying an item as having “A” DIF (labeled negligible DIF) when the effect size is large but so is the standard error is akin to making the “absence of evidence as evidence of absence” error (Altman & Bland, 1995). If we find large DIF, we find large DIF. But if the standard error is large relative to the parameter estimate, all we can say is we need more data to more firmly infer the presence and size of DIF. I have therefore shown a set of rules that can be used to label direct effect magnitudes according to asymptotically large sample sizes using the ETS DIF categories as a guide.

Other tidbits

Notes on sample size from the ETS as per Zwick (2012) (page 10)

“…at least 200 members in the smaller group and at least 500 in total are needed for DIF analyses performed at the test assembly phase. For DIF analyses performed at the preliminary item analysis phase (after a test has been adminstered byut before scores are reported), the minimum sample size requirements are 300 members in the smaller group and 700 in total. The rationale for the sample size requirements is that analysis results are likely to be unstable with smaller samples”

Example

Here is an example using the verbal data set from the difR package. I will use 4 items indicating whether or not the respondent reports they would curse in four situations:

Variable Description
S1DoCurse A bus fails to stop for me
S2DoCurse I miss a train because a clerk gave me faulty information
S3DoCurse The grocery store closes just as I am about to enter
S4DoCurse The operator disconnects me when I had used up my last 10 cents for a call
# Packages
# install.packages(difR)
# install.packages(MplusAutomation)
library(difR)
library(MplusAutomation)
## Version:  1.0.0
## We work hard to write this free software. Please help us get credit by citing: 
## 
## Hallquist, M. N. & Wiley, J. F. (2018). MplusAutomation: An R Package for Facilitating Large-Scale Latent Variable Analyses in Mplus. Structural Equation Modeling, 25, 621-638. doi: 10.1080/10705511.2017.1402334.
## 
## -- see citation("MplusAutomation").
# Process data
data(verbal)
keep <- c("S1DoCurse","S2DoCurse","S3DoCurse","S4DoCurse","Gender")
df <- verbal[keep]
names(df)[names(df)=="S1DoCurse"] <- "u1"
names(df)[names(df)=="S2DoCurse"] <- "u2"
names(df)[names(df)=="S3DoCurse"] <- "u3"
names(df)[names(df)=="S4DoCurse"] <- "u4"
names(df)[names(df)=="Gender"] <- "male"
# Preliminary Mplus program to get R^2 for latent variable
# Mplus object for the population model
m0 <- mplusObject(
  MODEL ="f by u1-u4*; f@1; f on male;",
  usevariables = colnames(df),
  rdata = df,
  VARIABLE = "CATEGORICAL = u1-u4",
  ANALYSIS = "ESTIMATOR = WLSMV; PARAMETERIZATION=DELTA;",
  OUTPUT = "STDY;"
)
# Run the population model
mplusModeler(m0, modelout = "m0.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm0_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## Estimated using WLSMV 
## Number of obs: 316, number of (free) parameters: 9 
## 
## Model: Chi2(df = 5) = 5.868, p = 0.3193 
## Baseline model: Chi2(df = 10) = 543.08, p = 0 
## 
## Fit Indices: 
## 
## CFI = 0.998, TLI = 0.997, SRMR = 0.03 
## RMSEA = 0.023, 90% CI [0, 0.084], p < .05 = 0.691 
## AIC = NA, BIC = NA 
## NULL
mo_results <- readModels("m0.out")
# extract the r-squared value for latent variable
fr2 <- mo_results$parameters$r2$est[mo_results$parameters$r2$param=="F"]
f.rv <- 1-fr2
# now five estimators
wd <- "ESTIMATOR = WLSMV; PARAMETERIZATION=DELTA;"
wt <- "ESTIMATOR = WLSMV; PARAMETERIZATION=THETA;"
ml <- "ESTIMATOR = MLR; LINK=LOGIT;"
mp <- "ESTIMATOR = MLR; LINK=PROBIT;"
by <- "ESTIMATOR = BAYES;"
# model statement with DIF for item 1
model.is <- paste0("f by u1-u4*; f@",f.rv,"; f on male; u1 on male;")
# other common stuff
var.is <- "CATEGORICAL = u1-u4"
m.wd <- mplusObject(ANALYSIS=wd, 
                    usevariables = colnames(df),  rdata = df, 
                    VARIABLE = var.is, MODEL = model.is )
m.wt <- mplusObject(ANALYSIS=wt, 
                    usevariables = colnames(df),  rdata = df, 
                    VARIABLE = var.is, MODEL = model.is )
m.ml <- mplusObject(ANALYSIS=ml, 
                    usevariables = colnames(df),  rdata = df, 
                    VARIABLE = var.is, MODEL = model.is )
m.mp <- mplusObject(ANALYSIS=mp, 
                    usevariables = colnames(df),  rdata = df, 
                    VARIABLE = var.is, MODEL = model.is )
m.by <- mplusObject(ANALYSIS=by, 
                    usevariables = colnames(df),  rdata = df, 
                    VARIABLE = var.is, MODEL = model.is)
# run 5 models
mplusModeler(m.wd, modelout = "m.wd.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm.wd_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## Estimated using WLSMV 
## Number of obs: 316, number of (free) parameters: 10 
## 
## Model: Chi2(df = 4) = 2.827, p = 0.5872 
## Baseline model: Chi2(df = 10) = 543.08, p = 0 
## 
## Fit Indices: 
## 
## CFI = 1, TLI = 1, SRMR = 0.019 
## RMSEA = 0, 90% CI [0, 0.073], p < .05 = 0.843 
## AIC = NA, BIC = NA 
## NULL
mplusModeler(m.wt, modelout = "m.wt.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm.wt_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## Estimated using WLSMV 
## Number of obs: 316, number of (free) parameters: 10 
## 
## Model: Chi2(df = 4) = 2.827, p = 0.5872 
## Baseline model: Chi2(df = 10) = 543.08, p = 0 
## 
## Fit Indices: 
## 
## CFI = 1, TLI = 1, SRMR = 0.019 
## RMSEA = 0, 90% CI [0, 0.073], p < .05 = 0.843 
## AIC = NA, BIC = NA 
## NULL
mplusModeler(m.ml, modelout = "m.ml.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm.ml_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## No PROPORTION OF DATA PRESENT sections found within COVARIANCE COVERAGE OF DATA output.
## Estimated using MLR 
## Number of obs: 316, number of (free) parameters: 10 
## 
## Fit Indices: 
## 
## CFI = NA, TLI = NA, SRMR = NA 
## RMSEA = NA, 90% CI [NA, NA], p < .05 = NA 
## AIC = 1397.25, BIC = 1434.807 
## NULL
mplusModeler(m.mp, modelout = "m.mp.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm.mp_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## No PROPORTION OF DATA PRESENT sections found within COVARIANCE COVERAGE OF DATA output.
## Estimated using MLR 
## Number of obs: 316, number of (free) parameters: 10 
## 
## Fit Indices: 
## 
## CFI = NA, TLI = NA, SRMR = NA 
## RMSEA = NA, 90% CI [NA, NA], p < .05 = NA 
## AIC = 1396.63, BIC = 1434.187 
## NULL
mplusModeler(m.by, modelout = "m.by.inp", writeData = "always", run = TRUE)
## The file(s)
##  'm.by_d216d92412863d9cf242c1be6f09e0d1.dat' 
## currently exist(s) and will be overwritten
## Estimated using BAYES 
## Number of obs: 316, number of (free) parameters: 10 
## 
## Fit Indices: 
## 
## CFI = NA, TLI = NA, SRMR = NA 
## RMSEA = NA, 90% CI [NA, NA], p < .05 = NA 
## AIC = NA, BIC = NA 
## NULL
# 5 sets of results
wd_results <- readModels("m.wd.out")
wt_results <- readModels("m.wt.out")
ml_results <- readModels("m.ml.out")
## No PROPORTION OF DATA PRESENT sections found within COVARIANCE COVERAGE OF DATA output.
mp_results <- readModels("m.mp.out")
## No PROPORTION OF DATA PRESENT sections found within COVARIANCE COVERAGE OF DATA output.
by_results <- readModels("m.by.out")
# 5 direct effects
k.wd <- wd_results$parameters$unstandardized$est[wd_results$parameters$unstandardized$paramHeader=="U1.ON" & wd_results$parameters$unstandardized$param=="MALE"]
k.wt <- wt_results$parameters$unstandardized$est[wt_results$parameters$unstandardized$paramHeader=="U1.ON" & wt_results$parameters$unstandardized$param=="MALE"]
k.ml <- ml_results$parameters$unstandardized$est[ml_results$parameters$unstandardized$paramHeader=="U1.ON" & ml_results$parameters$unstandardized$param=="MALE"]
k.mp <- mp_results$parameters$unstandardized$est[mp_results$parameters$unstandardized$paramHeader=="U1.ON" & mp_results$parameters$unstandardized$param=="MALE"]
k.by <- by_results$parameters$unstandardized$est[by_results$parameters$unstandardized$paramHeader=="U1.ON" & by_results$parameters$unstandardized$param=="MALE"]
# Will also need the measurement slope for WLSMV/Delta
l.wd <- wd_results$parameters$unstandardized$est[wd_results$parameters$unstandardized$paramHeader=="F.BY" & wd_results$parameters$unstandardized$param=="U1"]
# 5 estimates of MH D-DIF
ddif.wd <- 4*((k.wd)/sqrt(1-l.wd^2))
ddif.wt <- 4*k.wt
ddif.ml <- 2.35*k.ml
ddif.mp <- 4*k.mp
ddif.by <- 4*k.by
# Now get the MH DIF analysis using difR
difMH_results <- difMH(df, group="male", focal.name="0", anchor=c("u2","u3","u4"), save.output=TRUE)
ddif.difR <- 2.35*log(difMH_results$alphaMH)[1]
results.ddif <- rbind(ddif.wd,ddif.wt,ddif.ml,ddif.mp,ddif.by,ddif.difR)
results.kstar <- results.ddif/4
results.k <- rbind(k.wd,k.wt,k.ml,k.mp,k.by,NA)
results <- cbind(results.ddif, results.kstar, results.k)
colnames(results) <- c("DDif","kstar","k")
rownames(results) <- c("WLSMV/delta","WLSMV/theta","MLR/logit","MLR/probit","Bayes","difR::difMH")
results
##                  DDif      kstar      k
## WLSMV/delta -3.080378 -0.7700945 -0.300
## WLSMV/theta -2.720000 -0.6800000 -0.680
## MLR/logit   -2.641400 -0.6603500 -1.124
## MLR/probit  -2.676000 -0.6690000 -0.669
## Bayes       -2.568000 -0.6420000 -0.642
## difR::difMH -2.043606 -0.5109014     NA

I take the agreement of these DDif and kstar statistics as a win. (Note: as can be seen from the code, kstar is just DDif divided by 4. It’s an an attempt to express DDif on the scale of direct effects on a standardized scale, and similar to what would be returned by the three common scale Mplus estimator options [WLSMV/theta, MLR/probit, and Bayes]). The difR MH D-Dif estimate is expected to be biased towards the null because it is an observed score conditioning method and there are not very many items. The apparently biased-high estimate for the WLSMV/delta effect estimate makes me think my conversion math might be wrong.

References

Altman, D. G., & Bland, J. M. (1995). Statistics notes: Absence of evidence is not evidence of absence. BMJ, 311(7003), 485. https://doi.org/10.1136/bmj.311.7003.485

Camilli, G. (1994). Teacher’s corner: Origin of the scaling constant D=1.7 in item response theory. Journal of Educational and Behavioral Statistics, 19(3), 293.

Zwick, R., Thayer, D. T., & Wingersky, M. (1994). A simulation study of methods for assessing differential item functioning in computerized adaptive tests. Applied Psychological Measurement, 18(2), 121-140.

Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement (ETS RR-12-08). (ETS Research Report Series, Issue. https://www.ets.org/Media/Research/pdf/RR-12-08.pdf

Macintosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27(5), 372-379.