A dangerous side effect of categorical variables as factors when exporting data

In 2019, Aboumatar and Wise requested, and received, a retraction for their article published in the Journal of the American Medical Association (JAMA). The authors had discovered a coding error that reversed their original findings. They write:

The identified programming error was in a file used for preparation of the analytic data sets for statistical analysis and occurred while the variable referring to the study “arm” (ie, group) assignment was recoded. The purpose of the recoding was to change the randomization assignment variable format of “1, 2” to a binary format of “0, 1.” However, the assignment was made incorrectly and resulted in a reversed coding of the study groups.

This can easily happen. In this essay I will show a recipe for how this kind of error can occur with the following ingredients:

R for data management
labelled data
Factor variables
Converting data from R to Stata with haven::write_dta

REPEX 1 (the “right” way)

REPEX 1 behaves as expected. We create a data set with a binary variable (coded 0/1), apply labels to the levels of the binary variable (no/yes), and save to a Stata data file. The Stata data file retains the original variable values (0/1) and the labels.

# libraries and programs
# install.packages("pacman")
library(pacman)
pacman::p_load(dplyr, haven, labelled)

# Toy data
id <- c(10003, 10009, 10020)
vdr <- c(NA,1,0)
df <- as.data.frame(cbind(id,vdr))

# apply labels
df <- df %>% 
   labelled::set_variable_labels(vdr  = "rearrested") %>% 
   labelled::set_value_labels(vdr  = c("No" = 0 , "Yes" = 1)) 

# Write data to Stata format
haven::write_dta(df, "vdr.dta")

# Check on what was written
data_check_dta <- haven::read_dta("vdr.dta")
data_check_dta

## # A tibble: 3 x 2
##      id       vdr
##   <dbl> <dbl+lbl>
## 1 10003  NA      
## 2 10009   1 [Yes]
## 3 10020   0 [No]

The key we will see is that the variable df$vdr is stored as a numeric (double) with labels.

str(df$vdr)

##  dbl+lbl [1:3] NA,  1,  0
##  @ labels: Named num [1:2] 0 1
##   ..- attr(*, "names")= chr [1:2] "No" "Yes"
##  @ label : chr "rearrested"

REPEX 2 (the “wrong” way)

REPEX 2 goes wrong by converting df$vdr to a factor variable before writing the Stata data file. This is just the addition of one line of code (df$vdr <- as.factor(df$vdr)).

# Toy data
id <- c(10003, 10009, 10020)
vdr <- c(NA,1,0)
df <- as.data.frame(cbind(id,vdr))

# apply labels
df <- df %>% 
   labelled::set_variable_labels(vdr  = "rearrested") %>% 
   labelled::set_value_labels(vdr  = c("No" = 0 , "Yes" = 1)) 

# convert vdr to factor
df$vdr <- as.factor(df$vdr)

# Write data to Stata format
haven::write_dta(df, "vdr.dta")

# Check on what was written
data_check_dta <- haven::read_dta("vdr.dta")
data_check_dta

## # A tibble: 3 x 2
##      id       vdr
##   <dbl> <dbl+lbl>
## 1 10003    NA    
## 2 10009     2 [1]
## 3 10020     1 [0]

str(df$vdr)

##  Factor w/ 2 levels "0","1": NA 2 1

The problem is, in the Stata file the rows where, in the original R data frame the variable df$vdr was coded 0 are now coded 1 in the Stata file. If the Stata user was assuming that a 1 on the variable vdr indicated the presence of something, and a 0 indicated the absence of the thing (as was originally intended in the R data management), the Stata user will have it backwards (reversed). For example

use vdr.dta, clear
gen r=(vdr==1)

The Stata user will end up coding people who were not rearrested as having been rearrested, and vice versa. Big problem.

[Note: This Stata use could have protected him/her self by being more explicit in the coding of new variable r. For example, gen r = (vdr==1) if inlist(vdr,0,1) would have returned a variable that was either 1 of missing. It would still be not what the coder wanted, but it would highlight that there was some kind of a data problem.]

Variable labels seem to be lost in Stata, but they are there just different

Also note that the the Stata data file did not retain the value labels for the categorical variable vdr, when the variable as converted to a factor before writing to Stata format.

As a Stata user, your first clue to their being a problem is if value lables should be in your data file for categorical variables, but are not.

. use vdr.dta , clear

. table vdr

----------------------
      vdr |      Freq.
----------+-----------
        0 |          1
        1 |          1
----------------------

It seems like the variable vdr in the Stata data file is coded correctly, i.e., it has values 0/1. But Stata is using value lables in the table command. The value labels can be revealed with the codebook command:

. codebook vdr

----------------------------------------------------------------------------------------
vdr                                                                          (unlabeled)
----------------------------------------------------------------------------------------

                  type:  numeric (long)
                 label:  vdr

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  1/3

            tabulation:  Freq.   Numeric  Label
                             1         1  0
                             1         2  1
                             1         .

This kind of weirdness can be very difficult to catch.

Generalizing the issue: more than 2 categories

This issue generalizes to a situation where the categorical variable has more than 2 values. Here we demonstrate that the factor variable is represented as successively increasing integers in the saved Stata data file regardless of the original values in the R data frame.

id <- c(10003, 10009, 10020, 10031)
x <- c(0,1,2,5)
df <- as.data.frame(cbind(id,x))
df$x <- as.factor(df$x)
str(df$x)

##  Factor w/ 4 levels "0","1","2","5": 1 2 3 4

haven::write_dta(df,"x.dta")
data_check_dta <- haven::read_dta("x.dta")
data_check_dta

## # A tibble: 4 x 2
##      id         x
##   <dbl> <dbl+lbl>
## 1 10003     1 [0]
## 2 10009     2 [1]
## 3 10020     3 [2]
## 4 10031     4 [5]

This can end up being a particularly challenging problem when dealing with categorical data. I am imaging a situation where there is a multiple item rating scale with 5 response categories (1,2,3,4,5). Different subsets of the data or different study sites might have some “empty cells” or particular response categories that are not represented in the data frame. For example, one data frame might have no respondents with a response category of “3”. Another data frame might have no response categories that are “4”. When the data are written to Stata, both example data frames will have four responses but the meaning of the 3, 4 and 5 are different in the two studies. If the data are converted to factors and then written to Stata, it may be hard to find the right values to use.

Return to QSP blog post index

A dangerous side effect of categorical variables as factors when exporting data

Rich Jones

9/10/2021

REPEX 1 (the “right” way)

REPEX 2 (the “wrong” way)

Variable labels seem to be lost in Stata, but they are there just different

Generalizing the issue: more than 2 categories