Secondary data refer to data that exist and the use to which they are put are not the main purpose for which the data were collected. This guideline is intended to be very generic and not specific to a particular field or kind of data analysis project. It is important to give careful thought and planning before any data analysis project, well before data are analyzed.
Primary data analysis projects – meaning a data analysis directed at answering specific questions using a source of data specifically collected to address those questions – generally benefit from the careful thought and planning that went into designing the study, especially if the project is funded through a mechanism that involves peer scientific review.
Secondary analysis projects do not always benefit from the same level of built-in planning. With the availability of data, it is tempting to initiate data analysis activities in search of interesting relationships that might be worth pursuing and publishing. But there are important dangers in jumping into analysis without careful thought and planning. These include a glut of false-positive research findings in the literature, missed true-positive (i.e., false negative) research findings due to ill-formed cursory analysis decisions, a never-ending analysis merry-go-round (or inescapable Garden of Forking Paths) that eats up valuable time and does not lead to publishable research findings.
There are three phases to a secondary data analysis research project. Especially for team-based research projects, phases 1 and 2 can be pursued in parallel. Phase 3 should always be initiated after phases 1 and 2 are complete.
Phase 1: Conceptual and organizational
activities
Phase 2: Initial analysis
Phase 3: Formal analysis
Think of an interesting question or problem. Write it down. If there are hypotheses to test, write those down too. Be very specific.
Review the research question and hypotheses with co-investigators and/or co-authors. Refine. Iterate steps 1-2 until consensus is achieved.
Discuss and document authorship with co-authors. This includes criteria for authorship, tentative authorship order, roles and responsibilities for the proposed project with your co-authors, collaborators, and other collaborators. The purpose of this is to facilitate the discussion and make sure the discussion is transparent. Do not avoid or short this task because it is “premature” or feels awkward. Conversations of authorship are much more awkward when all the work is done but not everyone agrees on author order. Revise as needed throughout the development of the project and paper. The written documentation provides evidence of who was thought to be involved with the project when, and makes authorship order clear from the outset so there are fewer surprises later. Identify multiple target journals and consider details on manuscript preparation (length, number of tables, figures, etc.)
Find a reporting guideline that fits the kind of analysis you would like to do (e.g., the EQUATOR network) and use that as a guide for considering the analytic design of the project.
Adopt and adhere to a workflow system that facilitates reproducibility and facilitates communication with co-investigators (including future you). A well described workflow system for Stata users – and full of good generic advice – is described in JS Long’s Workflow for Data Analysis Using Stata. Numerous workflow systems have been described for R users (c.f. Jenny Bryan’s slide deck; This one from the British Ecological Society). The important issue is to have a set of rules and procedures that guide the organization of work files and use those rules and procedures consistently. The result is replicable analysis and a clear documentation of analysis decisions and procedures.
Identify a data source. Obtain data documentation. Understand the study design. If available, read a paper or two already published using the data source. If the study is an RCT, find out and record registry information. Don’t forget about sampling weights if appropriate (e.g., for complex multistage samples). Consider speaking with someone who is familiar with the data source, such as an investigator from the parent study. This step can help identify if there are others already working on the same topic, if there are other superior data sources for specific aspects of your questions, or specific analytic considerations that are important for that particular data set.
Think about the threats to the validity of conclusions from your research project, and develop a strategy to deal with those threats. The most common types of threats to the validity of your conclusions include bias, confounding, and chance.
Bias refers to systematic aspects of the design of the study that push conclusions one way or the other (in favor of your question or hypothesis [anti-conservative], or in opposition to your question or hypothesis [conservative]). Sometimes it is hard to know the truth of whether potential bias is conservative or anti-conservative. Regardless, you should at least identify potential sources of bias, and ideally develop a plan for dealing with or describing the range of influence of potential bias.
Confounding is a type of bias that refers to variable characteristics of the people (or other units of study) that may influence results. It is imperative that potential confounding factors be considered, identified, and a plan put into place to control for confounding factors prior to data analysis. Consider crafting a directed acyclic graph (DAG) if you know how to do that. Do not forget to consider conceptual colinearity or endogeneity among covariates, and among covariates and exposure variables and outcomes. Read more about confounding here.
Chance refers to the role of sampling variability on obtained results.
Consider how practical or clinical importance will be defined or conveyed. Having a clear idea of how practical importance will be defined helps convey the impact of the results when they come. But also: a clear idea of what will define practical or clinical importance is essential to frame related questions regarding statistical power and dealing with secondary data that may have super-adequate power to address the research question. Consider if research questions or aims can incorporate a metric of practical or clinical importance, and if so, reframe.
Consider if the available sample is large enough to test the desired hypothesis or obtain reasonably precise estimates or generate reliable (reasonably precise) inferences. Even descriptive studies have hidden hypothesis tests, but they might not be apparent until after the manuscript has been drafted (or said another way, descriptive studies end up being described as if they were hypothesis-testing studies, which is not appropriate and not a reason to avoid considering power and sample size from the beginning). Reconsider pursuing research questions that have inadequate power to detect the hypothesized effect (of practical and clinical importance).
There may be good reasons to pursue an under-powered study. For instance, if the question is important and the data are the best that are available, sometimes the best thing for the field would be to do the analyses and present the result with appropriate uncertainty.
On the other hand, there may be reasons to refrain from pursuing an underpowered study. Suppose the results of your research project do not support the hypotheses. If the sample size does not provide appropriate statistical power, the null findings are essentially uninformative. If the premise is strong, a null finding in an under-powered study might lead you or other investigators from giving up on an otherwise promising line of research (by inappropriately concluding the null is true).
Consider whether the sample size is very large and if small effects might achieve “statistical significance” but of small practical or clinical importance. Remember small clinical effects may have enormous public health impact.
Consider if/how multiple comparisons will be handled. Be specific about how many specific hypotheses are being tested, and in how many ways. Label primary analyses and confirmatory, secondary, and sensitivity analyses before you run them, and specify this in the written analysis protocol. This step is important, and iterative in some ways with Phase 1 steps 1 and 2. This step forces you to place your bets in terms of statistical power. Do you want to put all of your power into a single comparison? Do you want to spread your power out by testing several things at once? This is an important conceptual question to reach agreement on with your colleagues before you begin data analyses.
Sketch out or outline an analytic plan. Specify outcomes, exposures, confounders, and, if appropriate, effect modifiers (aka moderators) and mediators. Specify analytic procedures and tests of hypotheses or statistical summaries to generate. Specify procedures to address bias and chance. If there are covariates, consider specifying in the analysis plan looking at unadjusted and adjusted results. If the project is NIH-funded, it may be necessary to specify special handling of sex or other relevant biological variables. Often, a potential covariate may be either a covariate or a mediator but your data do not have enough temporal resolution to decide which is most appropriate. In this case, you may wish to show analyses with and without adjustment for the ambiguous covariate. Specify how missing data will be handled, both in the outcome and predictor variable sets. Sketch out draft/mock tables, figures, and captions. Check the analytic plan against publication guidelines, if available. Put this all in a Statistical Analysis Plan document and use it to log deviations from the original plan.
Review and revise the pre-statistical analysis with co-investigators or co-authors until consensus is achieved and all have “signed off” on the plan. At this point, you and your team could consider submitting the paper proposal as a registered report if your target journal supports that kind of a submission (see https://osf.io/8mpji/wiki/home/). However, it should be noted that these guidelines may not conform to the requirements of a registered report, and modification of and expansion of the procedures will be required to conform to the requirements of the target journal.
Obtain data and store securely. This can be a lengthy process if data are not publicly available, or available to you within your local research setting. Obtaining data from collaborators can involve a data use agreement (DUA) between your institution and the institution providing the data. This process can involve institutional signatory authorities, legal review, and information technology staff. You may need to secure institutional review board (IRB) approval or notification of exemption. When data are available, try to obtain data as close to source data as possible. Source data are data that are minimally processed relative to how observations were originally recorded. Make a backup copy of the source data file as resources allow (i.e., copy it and put it someplace safe and don’t ever touch it except for an emergency).
Define and code relevant variables (outcomes, exposures, confounders). Make sure you understand any missing value codes and skip patterns that may be present. Always use syntax (if you must point-and-click, paste it to a command file). Never overwrite the source data file, instead generate new working derived data files. Never recode an original variable and create a new variable with the same name as the original variable: always create a new variable name.
Examine descriptives by creating a “Table 1”. Include all variables to be used in the analysis. Do not encode the main hypothesis or question in the construction of this table. For instance, this table may have a column of variable names, variable labels, value labels, and distributional statistics. This step is (a) a data distribution check, and (b) an opportunity for all co-authors, collaborators, and colleagues to make sure that the concepts of interest are represented in the available data. You may chose a different format for “Table 1” in your manuscript. Also Report on missing data. Generate plots that demonstrate the distribution of variables to be used in the analysis. Also investigate empirical collinearity, out-of range values, extreme skew or other scaling issues with outcomes and predictors. Handle in a reasonable way.
Revise analytic plan to accommodate findings relevant to “Table 1.” For example, revise the strategy used to address missing data, or drop or reconsider confounders on the basis of their distribution. Review with co-investigators and co-authors or other collaborators.
From this point on, you must not modify the analytic plan. Otherwise you embark on a journey through the Garden of Forking Paths, and that is bad. Modifications to the analysis plan are not the same thing as secondary or sensitivity analyses (see Phase step 5). However sometimes you simply must make changes to the analytic plan after formal data analysis begins. For example, the strategy for handling missing data does not work, or the software won’t estimate the kind of model you’d’ve liked it to. Or you discover collinearity at the time of analysis that you missed in the conceptual review. It happens. But recognize that even if essential or justified, changing the analysis plan after starting the analysis remains a trip in the Garden of Forking Paths and the work is vulnerable to being perceived as or actually representing P-hacking. Responsible reporting in a manuscript will document these post-analysis modifications as limitations.
Prepare a crude figure that illustrates the main outcome or hypothesis to be evaluated. For example, a boxplot comparing two group means and distribution on a continuous outcome. Or a scatterplot showing how two continuous variables are related to one another. Don’t worry about publication quality at this point, focus on understanding the data. Sometimes tables will have to suffice where figures are too challenging.
Estimate a model that tests the aim, hypothesis, or otherwise accomplishes the goal of the analysis. Generate a near-publication quality table summarizing the model or results. Generate a figure that summarizes the modeled result, as appropriate. Remember to express results in the context of clinical and practical significance defined in Phase 1 Step 7, for example, if your primary results are expressed as ratio measures (relative risks, odds ratios) consider presenting the absolute effect estimates (e.g., risk differences) to facilitate interpretation. If there are covariates, examine adjusted and unadjusted models. Do not forget to assess the adequacy to which models fit and the degree to which model assumptions are satisfied.
Review results with co-authors, co-investigators and other collaborators. The goal of the review is to assess clarity of expression of model results and conformity to a priori specified analysis plan. Revise analyses as indicated following review to improve conformity to the a priori articulated analytic plan or otherwise improve the clarity of the presentation or model results.
Conduct sensitivity analyses or other additional analyses that help to determine the robustness of the inferences to model and modeling assumptions or otherwise help make the main result more understandable. But do not replace the main result with a result of a secondary or sensitivity analysis. Consider the problem of empirical collinearity. Test robustness of model inferences to analytic decisions (e.g., missing data handling, subject selection, covariate inclusion/exclusion).
Double check analysis code. This is a lot of work so that person would normally be a co-author or invited to become a co-author. Ideally, you’ll be sharing the code for the analysis with the entire world for all time, so make sure it’s neat and tidy and well-documented. Write the abstract and identify the main point of the article.
Generate publication-quality tables and figures. Review abstract and proposed final tables and figures with co-authors, and review sensitivity and other additional analyses with co-authors, collaborators, and other collaborators. Revise as necessary.
Write an outline of your paper. It may be helpful to discuss the outline with co-authors before you dive in to writing the paper.
Write the paper. This is a helpful resource. Identify a manuscript in the target journal and use it to guide formatting and presentation.
Sometimes investigators will
skip from Phase 1 step 1 (question) to step Phase 2 step 1 (obtain data) and then skip to Phase 3 1 (crude test), or
proceed from Phase 2 1 (have data) to step Phase 1 step 1 (question) to Phase 3 step 1 (crude test); or even
fish from step Phase 2 step 1 (obtain data) to Phase 3 step 1 (crude test).
Shortcut strategies are pursued to determine if the question is worth the effort given available data. It is hard to justify doing all the pre-statistical and initial analysis work if the desired answer isn’t apparent from a quick pass at step Phase 3 step 1 (crude test).
This is bad practice because it leads to the file-drawer problem, and lax data management and injudicious analysis decisions can lead to a type-II error (concluding an effect is not there, when it really is). The best way is thoughtful and systematic progression through the conceptualization, management, and analysis tasks. If the question is important, the data appropriate and high quality, then it will not matter if the motivating question is supported or not supported by the results. If a question is important, well-reasoned, and well-suited to the available data, the findings will be interesting regardless of the results.
Last Edited Date: November 10, 2023
Contributors:
Rich Jones, Brown University (rnjones@brown.edu)
Maria Glymour, UCSF (now Boston University)
Paul Crane, University
of Washington (pcrane@uw.edu)
Laura Gibbons, University of
Washington (gibbonsl@uw.edu)
Jennifer Manly, Columbia
University (jjm71@cumc.columbia.edu)