Coding Schemes for Binary Predictors in Logistic Regression

Abstract

This tutorial provides an overview of coding schemes for binary predictors in logistic regression, with a focus on dummy (or indicator) coding, effect coding, dummy centering, and weighted dummy centering. We explain how these methods impact the interpretation of model parameters—especially the intercept and slope—and clarify key concepts such as the “grand mean” and symmetric contrasts. Using an applied example involving sickness outcomes as a function of egg salad and tuna salad consumption, we illustrate how each coding scheme is implemented and discuss their implications in both balanced and unbalanced samples. We also extend the discussion to include interactions (e.g., egg × tuna) and provide numerical examples. Finally, we highlight situations in which each coding approach may be most appropriate or potentially misleading.

Introduction

Logistic regression is a widely used statistical technique for modeling binary outcomes. When a predictor is binary (e.g., indicating whether an individual ate egg salad), the way in which the predictor is coded can influence the interpretation of the regression coefficients. Commonly used coding schemes include:

Dummy Coding (0, 1): Designates one group as the reference. Also know as indicator coding, which is perhaps more polite but we will use dummy.
Effect Coding (-1, +1): Treats groups symmetrically around zero.
Dummy Centering: Subtracts the sample mean from the 0/1 indicator.
Weighted Dummy Centering (2× Centered Dummy): Scales the centered dummy to mimic the symmetric contrast of effect coding in balanced designs.

This paper reviews these methods in the context of logistic regression, explains what is meant by the “grand mean” and symmetric contrasts, and then applies these coding schemes to a real-world dataset. We also explore how interactions between binary predictors (e.g., egg and tuna) can be modeled and interpreted.

Methods

Overview of Coding Schemes

Consider the logistic regression model

\[ \text{logit}(u) = \beta_{0} + \beta_{1} x_{1}, \]

where \(u\) is the binary outcome (e.g., sickness: 1 = sick, 0 = healthy) and \(x_1\) is a binary predictor (e.g., egg salad consumption).

1. Dummy Coding (0, 1)

Definition: \(x_1\) is coded as 0 for one group (reference) and 1 for the other.
Interpretation:
- \(\beta_0\) is the log odds of the outcome for the reference group.
- \(\beta_1\) is the difference in log odds between the non-reference and reference groups.
Use: Appropriate when a natural baseline is present.

2. Effect Coding (-1, +1)

Definition: \(x_1\) is coded as \(-1\) and \(+1\) for the two groups.
Grand Mean: In effect coding, the intercept \(\beta_0\) represents the unweighted (arithmetic) average of the two groups’ log odds. For example, if the log odds for the two groups are \(L_0\) and \(L_1\), then \[ \beta_0 = \frac{L_0 + L_1}{2}. \]
Symmetric Contrast: The codes \(-1\) and \(+1\) are equally distant from zero, meaning that the intercept is the simple grand mean and the slope represents half the difference between the groups.
Use: Useful when no natural reference group exists and a balanced view is desired.

3. Dummy Centering

Definition: Begin with dummy coding (0, 1) and subtract the sample mean, \(p(x_1=1)\), so that \[ x_{1,\text{centered}} = x_1 - p(x_1=1). \]
Interpretation: The intercept now represents the overall (weighted) mean log odds, and the slope represents the difference in log odds between the groups based on the centered contrast.
Properties: The centered predictor has a mean of 0, which can simplify interpretation in models with interactions.

4. Weighted Dummy Centering (2× Centered Dummy)

Definition: Multiply the centered dummy by 2: \[ x_{1,\text{weighted}} = 2 \times (x_1 - p(x_1=1)). \]
Properties in Balanced Samples: When \(p(x_1=1)=0.5\), the codes become \(-1\) and \(+1\), identical to effect coding.
In Unbalanced Samples: Although the variable is centered (its weighted mean is zero), the two resulting values are not symmetric (e.g., if \(p(x_1=1)=0.3\), then \(x_1=0\) maps to \(-0.6\) and \(x_1=1\) maps to \(1.4\)).
Interpretation: The intercept remains the weighted mean log odds, while the slope must be scaled by the difference in coding values to interpret the full difference between groups.

Key Concepts

Grand Mean: In effect coding, the “grand mean” refers to the unweighted average of the groups’ log odds. It is the value at which the coded predictor is zero (i.e., the midpoint between -1 and +1), assuming equal weighting of the groups.
Symmetric Contrasts: A contrast is symmetric if the codes for the two groups are mirror images around zero (e.g., \(-1\) and \(+1\)). This symmetry simplifies interpretation because the intercept represents the simple average of the groups, and the effect (slope) is half the difference between them. Note that in weighted dummy centering the overall variable is centered (mean zero), but if the groups are unbalanced, the individual coding values will not be equally distant from zero.

Other Coding Schemes

While our focus is on the four main schemes above, other approaches include:

0.5/–0.5 Coding: Similar in spirit to effect coding but scaled differently. A good option if you are interested in effect coding and exponentiating coefficients to obtain odds ratios.
Difference Coding / Deviation Coding: Often equivalent to effect coding for binary predictors.
Helmert Coding: More relevant for predictors with more than two levels.
Reverse Dummy Coding: Simply reverses the reference group in dummy coding.

Exponentiated logistic regression coefficients as odds ratios

Exponentiating the coefficients to obtain odds ratios is a valid procedure regardless of the coding scheme. However, the interpretation of the resulting odds ratios does depend on how the predictors are coded. For example, with dummy coding a one-unit change corresponds directly to the change from the reference group (coded 0) to the comparison group (coded 1), so the exponentiated coefficient neatly represents the odds ratio between these groups. In mean-centered models (where you subtract the sample mean from a dummy variable), the odds ratio represents the effect for a one-unit deviation from the weighted mean, which is similar in interpretation.

With effect coding, where the groups are coded as –1 and +1, a one-unit increase represents only half the difference between groups; thus, the exponentiated coefficient reflects the odds ratio associated with that half-difference. If you want the odds ratio for the full contrast between groups in an effect-coded model, you would need to exponentiate twice the coefficient (or take the square of the exponentiated coefficient). In summary, exponentiating logistic regression coefficients is applicable for all coding schemes, but care must be taken in interpreting the odds ratios appropriately according to the specific parameterization. Using -0.5/+0.5 effect coding means that a one-unit change in the predictor directly represents the full difference between the two groups. When you exponentiate the coefficient from such a model, it gives you the odds ratio for the entire contrast between groups, rather than only half the difference. This simplifies interpretation and avoids the extra step of having to double the coefficient (or square the exponentiated value) to recover the full group difference, which is necessary with -1/+1 coding. Thus, -0.5/+0.5 effect coding is often preferred in logistic regression when the goal is to directly interpret the exponentiated coefficients as full odds ratios.

Applied Example

Consider the following dataset on sickness (\(u\)) and food consumption:

\(u\)	egg	tuna	\(N\)
0	0	0	14
0	0	1	12
0	1	0	14
0	1	1	10
1	0	0	4
1	0	1	15
1	1	0	3
1	1	1	12

# Create a data frame with the supplied data
data <- data.frame(
  u    = c(0, 0, 0, 0, 1, 1, 1, 1),
  egg  = c(0, 0, 1, 1, 0, 0, 1, 1),
  tuna = c(0, 1, 0, 1, 0, 1, 0, 1),
  N    = c(14, 12, 14, 10, 4, 15, 3, 12)
)

# Display the data
print(data)

##   u egg tuna  N
## 1 0   0    0 14
## 2 0   0    1 12
## 3 0   1    0 14
## 4 0   1    1 10
## 5 1   0    0  4
## 6 1   0    1 15
## 7 1   1    0  3
## 8 1   1    1 12

# Expand the data by replicating each row N times
df <- data[rep(1:nrow(data), data$N), ] |> subset(select = -N)
# Display the first few rows of the expanded data
head(df)

##     u egg tuna
## 1   0   0    0
## 1.1 0   0    0
## 1.2 0   0    0
## 1.3 0   0    0
## 1.4 0   0    0
## 1.5 0   0    0

We will focus on the predictor “egg” and aggregate across “tuna” for part of the analysis.

Group Statistics for “Egg”

For egg = 0:
- Total healthy (u = 0): \(14 + 12 = 26\)
- Total sick (u = 1): \(4 + 15 = 19\)
- Total \(N = 45\)
- \(p(u=1 \mid \text{egg}=0) \approx \frac{19}{45} \approx 0.4222\)
- Log odds \(\approx \ln\left(\frac{0.4222}{0.5778}\right) \approx -0.314\)
For egg = 1:
- Total healthy (u = 0): \(14 + 10 = 24\)
- Total sick (u = 1): \(3 + 12 = 15\)
- Total \(N = 39\)
- \(p(u=1 \mid \text{egg}=1) \approx \frac{15}{39} \approx 0.3846\)
- Log odds \(\approx \ln\left(\frac{0.3846}{0.6154}\right) \approx -0.470\)

The difference in log odds between egg = 1 and egg = 0 is approximately \(-0.156\).

The overall (weighted) proportion for egg = 1 is: \[ p(\text{egg}=1) \approx \frac{39}{84} \approx 0.4643. \] - Weighted mean log odds:
\[ \text{Weighted mean} \approx \frac{45}{84}(-0.314) + \frac{39}{84}(-0.470) \approx -0.3865. \] - Unweighted (grand) mean log odds:
\[ \text{Grand mean} \approx \frac{-0.314 + (-0.470)}{2} \approx -0.392. \]

Applying the Coding Schemes

1. Dummy Coding (0, 1)

Coding:
\(\text{egg}_\text{dummy} = 0\) for egg = 0 and \(1\) for egg = 1.
Model:
\(\text{logit}(u)=\beta_0 + \beta_1\, \text{egg}_\text{dummy}\)
Parameter Estimates:
- \(\beta_0 \approx -0.314\) (log odds for egg = 0)
- \(\beta_1 \approx -0.156\) (difference in log odds)

# Fit the logistic regression model
model <- glm(u ~ egg, data = df, family = binomial)

# Display a summary of the model
summary(model)

## 
## Call:
## glm(formula = u ~ egg, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.3137     0.3018  -1.039    0.299
## egg          -0.1563     0.4466  -0.350    0.726
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 113.26  on 82  degrees of freedom
## AIC: 117.26
## 
## Number of Fisher Scoring iterations: 4

2. Dummy Centering

Coding:
\(\text{egg}_\text{centered} = \text{egg}_\text{dummy} - 0.4643\)
- For egg = 0: \(-0.4643\)
- For egg = 1: \(1 - 0.4643 = 0.5357\)
Model:
\(\text{logit}(u)=\beta_0 + \beta_1\, \text{egg}_\text{centered}\)
Interpretation:
- \(\beta_0 \approx -0.3865\) (weighted mean log odds)
- The contrast between groups is \(0.5357 - (-0.4643)= 1\), so the slope remains approximately \(-0.156\).

# create centered covariates
df <- df |>
  transform(
    egg_centered = egg - mean(egg),
    tuna_centered = tuna - mean(tuna)
  )
# display 
df |> QSPtools::checkvar(egg, egg_centered)

## # A tibble: 2 × 3
##     egg egg_centered     n
##   <dbl>        <dbl> <int>
## 1     0       -0.464    45
## 2     1        0.536    39

df |> QSPtools::checkvar(tuna, tuna_centered)

## # A tibble: 2 × 3
##    tuna tuna_centered     n
##   <dbl>         <dbl> <int>
## 1     0        -0.583    35
## 2     1         0.417    49

# Fit the logistic regression model
modelc <- glm(u ~ egg_centered, data = df, family = binomial)

# Display a summary of the model
summary(modelc)

## 
## Call:
## glm(formula = u ~ egg_centered, family = binomial, data = df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -0.3862     0.2225  -1.736   0.0825 .
## egg_centered  -0.1563     0.4466  -0.350   0.7263  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 113.26  on 82  degrees of freedom
## AIC: 117.26
## 
## Number of Fisher Scoring iterations: 4

3. Effect Coding (-1, +1)

Coding:
\(\text{egg}_\text{effect} = -1\) for egg = 0 and \(+1\) for egg = 1.
Model:
\(\text{logit}(u)=\beta_0 + \beta_1\, \text{egg}_\text{effect}\)
Interpretation:
- \(\beta_0 \approx -0.392\) (the unweighted grand mean log odds)
- \(\beta_1 \approx -0.078\) because moving from \(-1\) to \(+1\) (a difference of 2) results in a change of \(2 \times (-0.078) \approx -0.156\).

# create effect coded covariates
df <- df |>
  transform(
     egg_effect = ifelse(egg == 0, -1, 1) ,
     tuna_effect = ifelse(tuna == 0, -1, 1)
     )
# display 
df |> QSPtools::checkvar(egg, egg_effect)

## # A tibble: 2 × 3
##     egg egg_effect     n
##   <dbl>      <dbl> <int>
## 1     0         -1    45
## 2     1          1    39

df |> QSPtools::checkvar(tuna, tuna_effect)

## # A tibble: 2 × 3
##    tuna tuna_effect     n
##   <dbl>       <dbl> <int>
## 1     0          -1    35
## 2     1           1    49

# Fit the logistic regression model
modele <- glm(u ~ egg_effect, data = df, family = binomial)

# Display a summary of the model
summary(modele)

## 
## Call:
## glm(formula = u ~ egg_effect, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.39183    0.22329  -1.755   0.0793 .
## egg_effect  -0.07817    0.22329  -0.350   0.7263  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 113.26  on 82  degrees of freedom
## AIC: 117.26
## 
## Number of Fisher Scoring iterations: 4

4. Weighted Dummy Centering (2× Centered Dummy)

Coding:
\(\text{egg}_\text{weighted} = 2 \times (\text{egg}_\text{dummy} - 0.4643)\)
- For egg = 0: \(2(0 - 0.4643)= -0.9286\)
- For egg = 1: \(2(1 - 0.4643)= 1.0714\)
Model:
\(\text{logit}(u)=\beta_0 + \beta_1\, \text{egg}_\text{weighted}\)
Interpretation:
- \(\beta_0 \approx -0.3865\) (weighted mean log odds)
- The full difference between the groups is \(1.0714 - (-0.9286) = 2\); hence, \(\beta_1 \approx -0.078\) so that \(2\times(-0.078) \approx -0.156\).

# create centered covariates
df <- df |>
  transform(
    egg_wc = 2*(egg - mean(egg)),
    tuna_wc = 2*(tuna - mean(tuna))
  )
# display 
df |> QSPtools::checkvar(egg, egg_wc)

## # A tibble: 2 × 3
##     egg egg_wc     n
##   <dbl>  <dbl> <int>
## 1     0 -0.929    45
## 2     1  1.07     39

df |> QSPtools::checkvar(tuna, tuna_wc)

## # A tibble: 2 × 3
##    tuna tuna_wc     n
##   <dbl>   <dbl> <int>
## 1     0  -1.17     35
## 2     1   0.833    49

# Fit the logistic regression model
modelwc <- glm(u ~ egg_wc, data = df, family = binomial)

# Display a summary of the model
summary(modelwc)

## 
## Call:
## glm(formula = u ~ egg_wc, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.38625    0.22248  -1.736   0.0825 .
## egg_wc      -0.07817    0.22329  -0.350   0.7263  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 113.26  on 82  degrees of freedom
## AIC: 117.26
## 
## Number of Fisher Scoring iterations: 4

Summary of models

Estimates and Standard Errors for Intercept and Egg Effect by Model
Model	Dummy	Dummy_Centered	Effect	Weighted_Centered
Egg	-0.156 (0.447)	-0.156 (0.447)	-0.078 (0.223)	-0.078 (0.223)
Intercept	-0.314 (0.302)	-0.386 (0.222)	-0.392 (0.223)	-0.386 (0.222)

Expanded Discussion on Interactions

When modeling interactions between two binary predictors—such as egg and tuna—the logistic regression model becomes

\[ \text{logit}(u)=\beta_{0}+\beta_{1}\,\text{egg}+\beta_{2}\,\text{tuna}+\beta_{3}\,(\text{egg}\times\text{tuna}). \]

For our dataset, the cell counts for each combination of egg and tuna are:

\(u\)	egg	tuna	\(N\)
0	0	0	14
0	0	1	12
0	1	0	14
0	1	1	10
1	0	0	4
1	0	1	15
1	1	0	3
1	1	1	12

Computing Cell-Level Log Odds

Egg = 0, Tuna = 0:
- Healthy: 14; Sick: 4; Total = 18
- \(p(u=1)=4/18\approx 0.2222\)
- Log odds \(\approx \ln\left(\frac{0.2222}{0.7778}\right)\approx -1.253\).
Egg = 0, Tuna = 1:
- Healthy: 12; Sick: 15; Total = 27
- \(p(u=1)=15/27\approx 0.5556\)
- Log odds \(\approx \ln\left(\frac{0.5556}{0.4444}\right)\approx 0.223\).
Egg = 1, Tuna = 0:
- Healthy: 14; Sick: 3; Total = 17
- \(p(u=1)=3/17\approx 0.1765\)
- Log odds \(\approx \ln\left(\frac{0.1765}{0.8235}\right)\approx -1.540\).
Egg = 1, Tuna = 1:
- Healthy: 10; Sick: 12; Total = 22
- \(p(u=1)=12/22\approx 0.5455\)
- Log odds \(\approx \ln\left(\frac{0.5455}{0.4545}\right)\approx 0.182\).

Interactions with Different Coding Schemes

1. Dummy Coding (0, 1)

Coding:
- egg: 0 (no egg) and 1 (ate egg)
- tuna: 0 (no tuna) and 1 (ate tuna)
- Interaction: simply egg × tuna.
Interpretation & Estimates:
- Intercept (\(\beta_0\)): Represents the log odds for egg = 0, tuna = 0: \(-1.253\).
- Main Effects:
  - Egg effect (tuna = 0): \(\beta_1 \approx -1.540 - (-1.253)= -0.287\).
  - Tuna effect (egg = 0): \(\beta_2 \approx 0.223 - (-1.253)= 1.476\).
- Interaction (\(\beta_3\)):
  - Egg effect in tuna = 1: \(0.182 - 0.223 = -0.041\).
  - Difference in egg effects across tuna levels: \((-0.041) - (-0.287)= 0.246\).
  - Thus, \(\beta_3\approx 0.246\).

# Fit the logistic regression model
modelint <- glm(u ~ egg*tuna, data = df, family = binomial)

# Display a summary of the model
summary(modelint)

## 
## Call:
## glm(formula = u ~ egg * tuna, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.2528     0.5669  -2.210   0.0271 *
## egg          -0.2877     0.8522  -0.338   0.7357  
## tuna          1.4759     0.6866   2.150   0.0316 *
## egg:tuna      0.2469     1.0293   0.240   0.8105  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 102.33  on 80  degrees of freedom
## AIC: 110.33
## 
## Number of Fisher Scoring iterations: 4

2. Dummy Centering

Coding:
- Compute sample proportions: \(p(\text{egg}=1)\approx 0.4643\) and \(p(\text{tuna}=1)\approx 0.5833\).
- Center variables: \[ \text{egg}_{c} = \text{egg} - 0.4643,\quad \text{tuna}_{c} = \text{tuna} - 0.5833. \]

Resulting Values:

Egg	Tuna	egg\(_{c}\)	tuna\(_{c}\)	Interaction = egg\(_{c}\)×tuna\(_{c}\)
0	0	-0.4643	-0.5833	0.2704
0	1	-0.4643	0.4167	-0.1935
1	0	0.5357	-0.5833	-0.3125
1	1	0.5357	0.4167	0.2232

Interpretation:
- Intercept (\(\beta_0\)): Represents the log odds when both predictors are at their means (egg = 0.4643, tuna = 0.5833), approximating the weighted mean log odds (about \(-0.387\)).
- Main Effects and Interaction:
  The slopes \(\beta_1\), \(\beta_2\), and \(\beta_3\) are interpreted as the change in log odds per unit deviation from the sample means. Although the overall contrast between groups remains (e.g., for egg, \(0.5357 - (-0.4643) = 1\)), the interpretation is relative to the weighted means.

# Fit the logistic regression model
modelcint <- glm(u ~ egg_centered*tuna_centered, data = df, family = binomial)

# Display a summary of the model
summary(modelcint)

## 
## Call:
## glm(formula = u ~ egg_centered * tuna_centered, family = binomial, 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                 -0.4585     0.2434  -1.884  0.05962 . 
## egg_centered                -0.1437     0.4894  -0.294  0.76907   
## tuna_centered                1.5905     0.5119   3.107  0.00189 **
## egg_centered:tuna_centered   0.2469     1.0293   0.240  0.81047   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 102.33  on 80  degrees of freedom
## AIC: 110.33
## 
## Number of Fisher Scoring iterations: 4

Note on comparability of hand-calculated and estimated parameters: The difference between the hand‐calculated intercept and the maximum likelihood estimate from the R model is not an error but rather a reflection of the inherent complexities of logistic regression. In a logistic model, the nonlinearity introduced by the logit link means that simple aggregated measures, such as a weighted mean of cell-level log odds, provide only an approximation. When interactions are included, the intercept is defined as the log odds when all predictors are at zero, and the maximum likelihood estimation adjusts the intercept and slopes to achieve the best overall fit. This adjustment, combined with the effects of replicating rows by frequency, can lead to intercept estimates that differ noticeably from a simplified calculation. Ultimately, while the hand-calculated value offers a rough benchmark, the model’s estimates, as provided by R’s glm() function, represent the most accurate reflection of the data’s underlying structure.

Note on significance level for tuna effect in dummy coded versus all other coding schemes: The reader will notice that the z-value for tuna is about 2.2 with dummy coding and about 3.1 with the centered scheme, and the other schemes reported below. The difference in the z-value for the tuna effect between dummy coding and the other schemes likely arises from how the baseline is defined and how that impacts the partitioning of variance. With dummy coding, the main effect for tuna is estimated relative to a specific reference cell (in this case, when egg = 0), and this can lead to a larger standard error if there is some overlap or collinearity with the interaction term. In contrast, coding schemes such as dummy centering and effect coding reparameterize the model so that the main effects are measured as deviations from the overall (weighted or grand) mean rather than a single reference cell. This reparameterization often reduces the correlation between main effects and interactions and can lead to more precise estimates (i.e., smaller standard errors), resulting in higher z-values. Thus, the smaller z-value for tuna with dummy coding reflects differences in how the baseline is set and how variability is allocated in the model rather than a substantive difference in the effect of tuna.

3. Effect Coding (-1, +1)

Coding:
- egg: \(-1\) for egg = 0 and \(+1\) for egg = 1
- tuna: \(-1\) for tuna = 0 and \(+1\) for tuna = 1
- Interaction: product of the effect codes.
Cell Codes:

Egg Tuna egg\(_{eff}\) tuna\(_{eff}\) Interaction (egg\(_{eff}\)×tuna\(_{eff}\))

0 0 -1 -1 +1

0 1 -1 +1 -1

1 0 +1 -1 -1

1 1 +1 +1 +1
Interpretation:
- Intercept (\(\beta_0\)): Represents the unweighted grand mean of the cell log odds. For our data, an approximate average might be: \[ (-1.253 + 0.223 - 1.540 + 0.182)/4 \approx -0.597. \]
- Main Effects:
  Each slope represents half the difference between the two groups. For example, the egg effect is such that a change from \(-1\) to \(+1\) (i.e., a difference of 2) corresponds to a full change of \(2\beta_1\).
- Interaction (\(\beta_3\)):
  Because of the symmetry, \(\beta_3\) represents half the difference in the egg (or tuna) effect across the levels of the other predictor. For example, one might compute \[ \beta_3 \approx \frac{[(0.182-0.223) - (-1.540+1.253)]}{2} \approx 0.123, \] so that a full contrast (difference of 2) recovers the full difference.

Egg	Tuna	egg\(_{eff}\)	tuna\(_{eff}\)	Interaction (egg\(_{eff}\)×tuna\(_{eff}\))
0	0	-1	-1	+1
0	1	-1	+1	-1
1	0	+1	-1	-1
1	1	+1	+1	+1

# Fit the logistic regression model
modeleint <-    glm(u ~ egg_effect*tuna_effect, data = df, family = binomial)

# Display a summary of the model
summary(modeleint)

## 
## Call:
## glm(formula = u ~ egg_effect * tuna_effect, family = binomial, 
##     data = df)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)            -0.59694    0.25733  -2.320  0.02036 * 
## egg_effect             -0.08213    0.25733  -0.319  0.74962   
## tuna_effect             0.79967    0.25733   3.108  0.00189 **
## egg_effect:tuna_effect  0.06172    0.25733   0.240  0.81047   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 102.33  on 80  degrees of freedom
## AIC: 110.33
## 
## Number of Fisher Scoring iterations: 4

Note on standard errors: In a balanced design using effect coding, the predictors are set up so that the contrasts are perfectly symmetric and orthogonal. This symmetry means that the design matrix for the predictors – and by extension, the interaction term – has columns that contribute equally to the model. As a result, the variance (and thus the standard error) of each coefficient estimate ends up being the same. In unbalanced designs, you would typically see differences in standard errors, but with effect coding in a balanced situation, the inherent symmetry forces them to be equal.

4. Weighted Dummy Centering (2× Centered Dummy)

Coding:
- First center each dummy variable as in dummy centering, then multiply by 2: \[ \text{egg}_{w}=2(\text{egg} - 0.4643),\quad \text{tuna}_{w}=2(\text{tuna} - 0.5833). \]
Resulting Values:
- For egg: 0 becomes \(-0.9286\) and 1 becomes \(1.0714\).
- For tuna: 0 becomes \(-1.1666\) and 1 becomes \(0.8334\).

Cell Codes for Interaction:

Egg	Tuna	egg\(_{w}\)	tuna\(_{w}\)	Interaction = egg\(_{w}\)×tuna\(_{w}\)
0	0	-0.9286	-1.1666	1.083
0	1	-0.9286	0.8334	-0.7738
1	0	1.0714	-1.1666	-1.2500
1	1	1.0714	0.8334	0.8928

Interpretation:
- Intercept (\(\beta_0\)): Remains the weighted mean log odds (approximately \(-0.387\)).
- Main Effects:
  With the full contrast for egg being \(1.0714 - (-0.9286)= 2\), the slope \(\beta_1\) is scaled so that \(2\beta_1\) gives the full difference in log odds. For example, if the difference in the tuna = 0 stratum is \(-0.287\), then \(\beta_1\approx -0.144\).
- Interaction (\(\beta_3\)):
  The interaction is interpreted based on these scaled contrasts. In balanced samples, the scaling mimics effect coding; in unbalanced samples, the actual distances must be taken into account.

# Fit the logistic regression model
modelwcint <- glm(u ~ egg_wc*tuna_wc, data = df, family = binomial)

# Display a summary of the model
summary(modelwcint)

## 
## Call:
## glm(formula = u ~ egg_wc * tuna_wc, family = binomial, data = df)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -0.45853    0.24343  -1.884  0.05962 . 
## egg_wc         -0.07184    0.24469  -0.294  0.76907   
## tuna_wc         0.79526    0.25596   3.107  0.00189 **
## egg_wc:tuna_wc  0.06172    0.25733   0.240  0.81047   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.38  on 83  degrees of freedom
## Residual deviance: 102.33  on 80  degrees of freedom
## AIC: 110.33
## 
## Number of Fisher Scoring iterations: 4

Summary of interaction models

Estimates and Standard Errors for Interaction Models
Model	Dummy	Dummy_Centered	Effect	Weighted_Centered
Intercept	-1.253 (0.567)	-0.459 (0.243)	-0.597 (0.257)	-0.459 (0.243)
Egg	-0.288 (0.852)	-0.144 (0.489)	-0.082 (0.257)	-0.072 (0.245)
Tuna	1.476 (0.687)	1.591 (0.512)	0.800 (0.257)	0.795 (0.256)
Interaction	0.247 (1.029)	0.247 (1.029)	0.062 (0.257)	0.062 (0.257)

Discussion

Situations and Considerations

Dummy Coding:
Best when a natural reference group exists and you want to interpret changes relative to that group. The intercept represents the baseline cell (egg = 0, tuna = 0). However, it does not provide an intercept reflecting the overall mean outcome.
Dummy Centering:
Useful when the research question calls for an intercept that reflects the overall (weighted) mean outcome. It is simple to implement and helps reduce collinearity in interaction models. However, the contrast between groups may be asymmetric in unbalanced samples.
Effect Coding:
Provides a symmetric contrast with fixed values (\(-1\) and \(+1\)), so the intercept is the unweighted grand mean and slopes are interpreted as half differences. This is ideal in balanced samples but can be misleading in unbalanced designs where the unweighted average does not reflect sample proportions.
Weighted Dummy Centering:
Attempts to combine centering (so that the intercept reflects the overall mean) with a symmetric contrast (in balanced designs). In unbalanced samples, although the overall mean remains zero, the individual codes are not equally spaced from zero. This requires careful interpretation of the slopes and interactions.

Handling Interactions

When including interactions in logistic regression with binary predictors, the choice of coding affects both the main effects and the interaction term:

Dummy Coding:
The intercept represents the cell with both predictors at 0. Main effects are differences relative to that baseline, and the interaction is interpreted as the difference-in-differences.
Dummy Centering:
Centers predictors at their sample means so that the intercept reflects the weighted overall outcome. Main effects and interactions are deviations from this mean.
Effect Coding:
Offers a symmetric (unweighted) contrast, with the intercept as the unweighted grand mean and slopes as half differences.
Weighted Dummy Centering:
Combines the benefits of centering with scaling that, in balanced samples, mimics effect coding. In unbalanced samples, the interpretation of main effects and interactions requires accounting for the non-symmetric individual coding values.

When Each Scheme May Be Misleading

Unbalanced Samples:
- Effect Coding: May be misleading if one assumes the intercept reflects the overall outcome when it actually represents an unweighted average.
- Dummy Centering: Provides a weighted mean intercept, but the contrasts can be asymmetric.
Dummy Coding:
Using a natural reference group may hide differences if that group is not representative of the overall population.
Weighted Dummy Centering:
While it ensures a zero mean, lack of perfect symmetry in unbalanced designs requires careful interpretation of effect sizes.

Conclusion

The choice of coding scheme affects whether the intercept reflects a specific reference group, the overall (weighted) outcome, or an unweighted grand mean. In balanced samples, effect coding and weighted dummy centering yield symmetric contrasts that simplify interpretation; in unbalanced samples, weighted centering maintains a zero mean but the contrasts are not perfectly symmetric.

Analysts should select the coding scheme that best aligns with their research question. For instance, if the primary interest is comparing treatment effects relative to a natural control group, dummy coding may be most informative. Alternatively, if the focus is on deviations from the overall outcome, centered approaches may be preferable. When modeling interactions, centering can reduce multicollinearity and simplify the interpretation of main effects, although special care is needed in unbalanced samples.

Coding Schemes for Binary Predictors in Logistic Regression

Rich Jones

2025-02-23

Abstract

Introduction

Methods

Overview of Coding Schemes

1. Dummy Coding (0, 1)

2. Effect Coding (-1, +1)

3. Dummy Centering

4. Weighted Dummy Centering (2× Centered Dummy)

Key Concepts

Other Coding Schemes

Exponentiated logistic regression coefficients as odds ratios

Applied Example

Group Statistics for “Egg”

Applying the Coding Schemes

1. Dummy Coding (0, 1)

2. Dummy Centering

3. Effect Coding (-1, +1)

4. Weighted Dummy Centering (2× Centered Dummy)

Summary of models

Expanded Discussion on Interactions

Computing Cell-Level Log Odds

Interactions with Different Coding Schemes

1. Dummy Coding (0, 1)

2. Dummy Centering

3. Effect Coding (-1, +1)

4. Weighted Dummy Centering (2× Centered Dummy)

Summary of interaction models

Discussion

Situations and Considerations

Handling Interactions

When Each Scheme May Be Misleading

Conclusion