This tutorial provides an overview of coding schemes for binary predictors in logistic regression, with a focus on dummy (or indicator) coding, effect coding, dummy centering, and weighted dummy centering. We explain how these methods impact the interpretation of model parameters—especially the intercept and slope—and clarify key concepts such as the “grand mean” and symmetric contrasts. Using an applied example involving sickness outcomes as a function of egg salad and tuna salad consumption, we illustrate how each coding scheme is implemented and discuss their implications in both balanced and unbalanced samples. We also extend the discussion to include interactions (e.g., egg × tuna) and provide numerical examples. Finally, we highlight situations in which each coding approach may be most appropriate or potentially misleading.
Logistic regression is a widely used statistical technique for modeling binary outcomes. When a predictor is binary (e.g., indicating whether an individual ate egg salad), the way in which the predictor is coded can influence the interpretation of the regression coefficients. Commonly used coding schemes include:
This paper reviews these methods in the context of logistic regression, explains what is meant by the “grand mean” and symmetric contrasts, and then applies these coding schemes to a real-world dataset. We also explore how interactions between binary predictors (e.g., egg and tuna) can be modeled and interpreted.
Consider the logistic regression model
\[ \text{logit}(u) = \beta_{0} + \beta_{1} x_{1}, \]
where \(u\) is the binary outcome (e.g., sickness: 1 = sick, 0 = healthy) and \(x_1\) is a binary predictor (e.g., egg salad consumption).
While our focus is on the four main schemes above, other approaches include:
Exponentiating the coefficients to obtain odds ratios is a valid procedure regardless of the coding scheme. However, the interpretation of the resulting odds ratios does depend on how the predictors are coded. For example, with dummy coding a one-unit change corresponds directly to the change from the reference group (coded 0) to the comparison group (coded 1), so the exponentiated coefficient neatly represents the odds ratio between these groups. In mean-centered models (where you subtract the sample mean from a dummy variable), the odds ratio represents the effect for a one-unit deviation from the weighted mean, which is similar in interpretation.
With effect coding, where the groups are coded as –1 and +1, a one-unit increase represents only half the difference between groups; thus, the exponentiated coefficient reflects the odds ratio associated with that half-difference. If you want the odds ratio for the full contrast between groups in an effect-coded model, you would need to exponentiate twice the coefficient (or take the square of the exponentiated coefficient). In summary, exponentiating logistic regression coefficients is applicable for all coding schemes, but care must be taken in interpreting the odds ratios appropriately according to the specific parameterization. Using -0.5/+0.5 effect coding means that a one-unit change in the predictor directly represents the full difference between the two groups. When you exponentiate the coefficient from such a model, it gives you the odds ratio for the entire contrast between groups, rather than only half the difference. This simplifies interpretation and avoids the extra step of having to double the coefficient (or square the exponentiated value) to recover the full group difference, which is necessary with -1/+1 coding. Thus, -0.5/+0.5 effect coding is often preferred in logistic regression when the goal is to directly interpret the exponentiated coefficients as full odds ratios.
Consider the following dataset on sickness (\(u\)) and food consumption:
\(u\) | egg | tuna | \(N\) |
---|---|---|---|
0 | 0 | 0 | 14 |
0 | 0 | 1 | 12 |
0 | 1 | 0 | 14 |
0 | 1 | 1 | 10 |
1 | 0 | 0 | 4 |
1 | 0 | 1 | 15 |
1 | 1 | 0 | 3 |
1 | 1 | 1 | 12 |
# Create a data frame with the supplied data
data <- data.frame(
u = c(0, 0, 0, 0, 1, 1, 1, 1),
egg = c(0, 0, 1, 1, 0, 0, 1, 1),
tuna = c(0, 1, 0, 1, 0, 1, 0, 1),
N = c(14, 12, 14, 10, 4, 15, 3, 12)
)
# Display the data
print(data)
## u egg tuna N
## 1 0 0 0 14
## 2 0 0 1 12
## 3 0 1 0 14
## 4 0 1 1 10
## 5 1 0 0 4
## 6 1 0 1 15
## 7 1 1 0 3
## 8 1 1 1 12
# Expand the data by replicating each row N times
df <- data[rep(1:nrow(data), data$N), ] |> subset(select = -N)
# Display the first few rows of the expanded data
head(df)
## u egg tuna
## 1 0 0 0
## 1.1 0 0 0
## 1.2 0 0 0
## 1.3 0 0 0
## 1.4 0 0 0
## 1.5 0 0 0
We will focus on the predictor “egg” and aggregate across “tuna” for part of the analysis.
The difference in log odds between egg = 1 and egg = 0 is approximately \(-0.156\).
The overall (weighted) proportion for egg = 1 is: \[
p(\text{egg}=1) \approx \frac{39}{84} \approx 0.4643.
\] - Weighted mean log odds:
\[
\text{Weighted mean} \approx \frac{45}{84}(-0.314) +
\frac{39}{84}(-0.470) \approx -0.3865.
\] - Unweighted (grand) mean log odds:
\[
\text{Grand mean} \approx \frac{-0.314 + (-0.470)}{2} \approx -0.392.
\]
# Fit the logistic regression model
model <- glm(u ~ egg, data = df, family = binomial)
# Display a summary of the model
summary(model)
##
## Call:
## glm(formula = u ~ egg, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3137 0.3018 -1.039 0.299
## egg -0.1563 0.4466 -0.350 0.726
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 113.26 on 82 degrees of freedom
## AIC: 117.26
##
## Number of Fisher Scoring iterations: 4
# create centered covariates
df <- df |>
transform(
egg_centered = egg - mean(egg),
tuna_centered = tuna - mean(tuna)
)
# display
df |> QSPtools::checkvar(egg, egg_centered)
## # A tibble: 2 × 3
## egg egg_centered n
## <dbl> <dbl> <int>
## 1 0 -0.464 45
## 2 1 0.536 39
df |> QSPtools::checkvar(tuna, tuna_centered)
## # A tibble: 2 × 3
## tuna tuna_centered n
## <dbl> <dbl> <int>
## 1 0 -0.583 35
## 2 1 0.417 49
# Fit the logistic regression model
modelc <- glm(u ~ egg_centered, data = df, family = binomial)
# Display a summary of the model
summary(modelc)
##
## Call:
## glm(formula = u ~ egg_centered, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3862 0.2225 -1.736 0.0825 .
## egg_centered -0.1563 0.4466 -0.350 0.7263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 113.26 on 82 degrees of freedom
## AIC: 117.26
##
## Number of Fisher Scoring iterations: 4
# create effect coded covariates
df <- df |>
transform(
egg_effect = ifelse(egg == 0, -1, 1) ,
tuna_effect = ifelse(tuna == 0, -1, 1)
)
# display
df |> QSPtools::checkvar(egg, egg_effect)
## # A tibble: 2 × 3
## egg egg_effect n
## <dbl> <dbl> <int>
## 1 0 -1 45
## 2 1 1 39
df |> QSPtools::checkvar(tuna, tuna_effect)
## # A tibble: 2 × 3
## tuna tuna_effect n
## <dbl> <dbl> <int>
## 1 0 -1 35
## 2 1 1 49
# Fit the logistic regression model
modele <- glm(u ~ egg_effect, data = df, family = binomial)
# Display a summary of the model
summary(modele)
##
## Call:
## glm(formula = u ~ egg_effect, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.39183 0.22329 -1.755 0.0793 .
## egg_effect -0.07817 0.22329 -0.350 0.7263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 113.26 on 82 degrees of freedom
## AIC: 117.26
##
## Number of Fisher Scoring iterations: 4
# create centered covariates
df <- df |>
transform(
egg_wc = 2*(egg - mean(egg)),
tuna_wc = 2*(tuna - mean(tuna))
)
# display
df |> QSPtools::checkvar(egg, egg_wc)
## # A tibble: 2 × 3
## egg egg_wc n
## <dbl> <dbl> <int>
## 1 0 -0.929 45
## 2 1 1.07 39
df |> QSPtools::checkvar(tuna, tuna_wc)
## # A tibble: 2 × 3
## tuna tuna_wc n
## <dbl> <dbl> <int>
## 1 0 -1.17 35
## 2 1 0.833 49
# Fit the logistic regression model
modelwc <- glm(u ~ egg_wc, data = df, family = binomial)
# Display a summary of the model
summary(modelwc)
##
## Call:
## glm(formula = u ~ egg_wc, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38625 0.22248 -1.736 0.0825 .
## egg_wc -0.07817 0.22329 -0.350 0.7263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 113.26 on 82 degrees of freedom
## AIC: 117.26
##
## Number of Fisher Scoring iterations: 4
Model | Dummy | Dummy_Centered | Effect | Weighted_Centered |
---|---|---|---|---|
Egg | -0.156 (0.447) | -0.156 (0.447) | -0.078 (0.223) | -0.078 (0.223) |
Intercept | -0.314 (0.302) | -0.386 (0.222) | -0.392 (0.223) | -0.386 (0.222) |
When modeling interactions between two binary predictors—such as egg and tuna—the logistic regression model becomes
\[ \text{logit}(u)=\beta_{0}+\beta_{1}\,\text{egg}+\beta_{2}\,\text{tuna}+\beta_{3}\,(\text{egg}\times\text{tuna}). \]
For our dataset, the cell counts for each combination of egg and tuna are:
\(u\) | egg | tuna | \(N\) |
---|---|---|---|
0 | 0 | 0 | 14 |
0 | 0 | 1 | 12 |
0 | 1 | 0 | 14 |
0 | 1 | 1 | 10 |
1 | 0 | 0 | 4 |
1 | 0 | 1 | 15 |
1 | 1 | 0 | 3 |
1 | 1 | 1 | 12 |
# Fit the logistic regression model
modelint <- glm(u ~ egg*tuna, data = df, family = binomial)
# Display a summary of the model
summary(modelint)
##
## Call:
## glm(formula = u ~ egg * tuna, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2528 0.5669 -2.210 0.0271 *
## egg -0.2877 0.8522 -0.338 0.7357
## tuna 1.4759 0.6866 2.150 0.0316 *
## egg:tuna 0.2469 1.0293 0.240 0.8105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 102.33 on 80 degrees of freedom
## AIC: 110.33
##
## Number of Fisher Scoring iterations: 4
Coding:
Resulting Values:
Egg | Tuna | egg\(_{c}\) | tuna\(_{c}\) | Interaction = egg\(_{c}\)×tuna\(_{c}\) |
---|---|---|---|---|
0 | 0 | -0.4643 | -0.5833 | 0.2704 |
0 | 1 | -0.4643 | 0.4167 | -0.1935 |
1 | 0 | 0.5357 | -0.5833 | -0.3125 |
1 | 1 | 0.5357 | 0.4167 | 0.2232 |
Interpretation:
# Fit the logistic regression model
modelcint <- glm(u ~ egg_centered*tuna_centered, data = df, family = binomial)
# Display a summary of the model
summary(modelcint)
##
## Call:
## glm(formula = u ~ egg_centered * tuna_centered, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4585 0.2434 -1.884 0.05962 .
## egg_centered -0.1437 0.4894 -0.294 0.76907
## tuna_centered 1.5905 0.5119 3.107 0.00189 **
## egg_centered:tuna_centered 0.2469 1.0293 0.240 0.81047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 102.33 on 80 degrees of freedom
## AIC: 110.33
##
## Number of Fisher Scoring iterations: 4
Note on comparability of hand-calculated and estimated
parameters: The difference between the hand‐calculated
intercept and the maximum likelihood estimate from the R model is not an
error but rather a reflection of the inherent complexities of logistic
regression. In a logistic model, the nonlinearity introduced by the
logit link means that simple aggregated measures, such as a weighted
mean of cell-level log odds, provide only an approximation. When
interactions are included, the intercept is defined as the log odds when
all predictors are at zero, and the maximum likelihood estimation
adjusts the intercept and slopes to achieve the best overall fit. This
adjustment, combined with the effects of replicating rows by frequency,
can lead to intercept estimates that differ noticeably from a simplified
calculation. Ultimately, while the hand-calculated value offers a rough
benchmark, the model’s estimates, as provided by R’s glm()
function, represent the most accurate reflection of the data’s
underlying structure.
Note on significance level for tuna
effect in
dummy coded versus all other coding schemes: The reader will
notice that the z-value for tuna
is about 2.2 with dummy
coding and about 3.1 with the centered scheme, and the other schemes
reported below. The difference in the z-value for the tuna effect
between dummy coding and the other schemes likely arises from how the
baseline is defined and how that impacts the partitioning of variance.
With dummy coding, the main effect for tuna is estimated relative to a
specific reference cell (in this case, when egg = 0), and this can lead
to a larger standard error if there is some overlap or collinearity with
the interaction term. In contrast, coding schemes such as dummy
centering and effect coding reparameterize the model so that the main
effects are measured as deviations from the overall (weighted or grand)
mean rather than a single reference cell. This reparameterization often
reduces the correlation between main effects and interactions and can
lead to more precise estimates (i.e., smaller standard errors),
resulting in higher z-values. Thus, the smaller z-value for tuna with
dummy coding reflects differences in how the baseline is set and how
variability is allocated in the model rather than a substantive
difference in the effect of tuna.
Coding:
Cell Codes:
Egg | Tuna | egg\(_{eff}\) | tuna\(_{eff}\) | Interaction (egg\(_{eff}\)×tuna\(_{eff}\)) |
---|---|---|---|---|
0 | 0 | -1 | -1 | +1 |
0 | 1 | -1 | +1 | -1 |
1 | 0 | +1 | -1 | -1 |
1 | 1 | +1 | +1 | +1 |
Interpretation:
# Fit the logistic regression model
modeleint <- glm(u ~ egg_effect*tuna_effect, data = df, family = binomial)
# Display a summary of the model
summary(modeleint)
##
## Call:
## glm(formula = u ~ egg_effect * tuna_effect, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59694 0.25733 -2.320 0.02036 *
## egg_effect -0.08213 0.25733 -0.319 0.74962
## tuna_effect 0.79967 0.25733 3.108 0.00189 **
## egg_effect:tuna_effect 0.06172 0.25733 0.240 0.81047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 102.33 on 80 degrees of freedom
## AIC: 110.33
##
## Number of Fisher Scoring iterations: 4
Note on standard errors: In a balanced design using effect coding, the predictors are set up so that the contrasts are perfectly symmetric and orthogonal. This symmetry means that the design matrix for the predictors – and by extension, the interaction term – has columns that contribute equally to the model. As a result, the variance (and thus the standard error) of each coefficient estimate ends up being the same. In unbalanced designs, you would typically see differences in standard errors, but with effect coding in a balanced situation, the inherent symmetry forces them to be equal.
Coding:
Resulting Values:
Cell Codes for Interaction:
Egg | Tuna | egg\(_{w}\) | tuna\(_{w}\) | Interaction = egg\(_{w}\)×tuna\(_{w}\) |
---|---|---|---|---|
0 | 0 | -0.9286 | -1.1666 | 1.083 |
0 | 1 | -0.9286 | 0.8334 | -0.7738 |
1 | 0 | 1.0714 | -1.1666 | -1.2500 |
1 | 1 | 1.0714 | 0.8334 | 0.8928 |
Interpretation:
# Fit the logistic regression model
modelwcint <- glm(u ~ egg_wc*tuna_wc, data = df, family = binomial)
# Display a summary of the model
summary(modelwcint)
##
## Call:
## glm(formula = u ~ egg_wc * tuna_wc, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.45853 0.24343 -1.884 0.05962 .
## egg_wc -0.07184 0.24469 -0.294 0.76907
## tuna_wc 0.79526 0.25596 3.107 0.00189 **
## egg_wc:tuna_wc 0.06172 0.25733 0.240 0.81047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.38 on 83 degrees of freedom
## Residual deviance: 102.33 on 80 degrees of freedom
## AIC: 110.33
##
## Number of Fisher Scoring iterations: 4
Model | Dummy | Dummy_Centered | Effect | Weighted_Centered |
---|---|---|---|---|
Intercept | -1.253 (0.567) | -0.459 (0.243) | -0.597 (0.257) | -0.459 (0.243) |
Egg | -0.288 (0.852) | -0.144 (0.489) | -0.082 (0.257) | -0.072 (0.245) |
Tuna | 1.476 (0.687) | 1.591 (0.512) | 0.800 (0.257) | 0.795 (0.256) |
Interaction | 0.247 (1.029) | 0.247 (1.029) | 0.062 (0.257) | 0.062 (0.257) |
Dummy Coding:
Best when a natural reference group exists and you want to interpret
changes relative to that group. The intercept represents the baseline
cell (egg = 0, tuna = 0). However, it does not provide an intercept
reflecting the overall mean outcome.
Dummy Centering:
Useful when the research question calls for an intercept that reflects
the overall (weighted) mean outcome. It is simple to implement and helps
reduce collinearity in interaction models. However, the contrast between
groups may be asymmetric in unbalanced samples.
Effect Coding:
Provides a symmetric contrast with fixed values (\(-1\) and \(+1\)), so the intercept is the unweighted
grand mean and slopes are interpreted as half differences. This is ideal
in balanced samples but can be misleading in unbalanced designs where
the unweighted average does not reflect sample proportions.
Weighted Dummy Centering:
Attempts to combine centering (so that the intercept reflects the
overall mean) with a symmetric contrast (in balanced designs). In
unbalanced samples, although the overall mean remains zero, the
individual codes are not equally spaced from zero. This requires careful
interpretation of the slopes and interactions.
When including interactions in logistic regression with binary predictors, the choice of coding affects both the main effects and the interaction term:
The choice of coding scheme affects whether the intercept reflects a specific reference group, the overall (weighted) outcome, or an unweighted grand mean. In balanced samples, effect coding and weighted dummy centering yield symmetric contrasts that simplify interpretation; in unbalanced samples, weighted centering maintains a zero mean but the contrasts are not perfectly symmetric.
Analysts should select the coding scheme that best aligns with their research question. For instance, if the primary interest is comparing treatment effects relative to a natural control group, dummy coding may be most informative. Alternatively, if the focus is on deviations from the overall outcome, centered approaches may be preferable. When modeling interactions, centering can reduce multicollinearity and simplify the interpretation of main effects, although special care is needed in unbalanced samples.