Linking or Equating Scores: Linear Equating

Author

Richard N. Jones

Published

December 21, 2025

1 Bottom Line Up Front: Place scores on test X on the scale of test Y when scores come from randomly equivalent groups

The general formula for linear equating is:

\[ \hat{y}(x) = \mu_Y + \frac{\sigma_Y}{\sigma_X}(x - \mu_X). \]

To get the score on test X on a scale equal to test Y, calculate how far the original score falls from its own mean (\(x - \mu_X)\), stretch or shrink that distance to match the spread of test Y (\(\frac{\sigma_Y}{\sigma_X}\)), and then apply it to the test Y’s average (\(\mu_Y\)).

Linear equating is defensible when:

  • Scores on X and Y come from randomly equivalent groups. If equivalence is induced using measured covariates (e.g., weighting), the result is typically interpretable for the target population defined by that balancing step rather than for everyone.

  • The forms were designed as alternate forms (“parallel” forms in the classical test theory sense): they target the same construct, sample the same content domain in similar proportions, and would be expected to have similar measurement properties if administered to the same population. This is what makes an equating claim meaningful; without it, a conversion is usually better described as linking.1

  • A linear mapping is a good approximation over the score range you care about (shape mismatch from ceiling/floor effects is a common failure mode).

When these conditions hold, linear equating is transparent, symmetric (unlike regression prediction), and easy to explain.

2 Why this topic shows up in longitudinal aging studies

Longitudinal aging studies run into version changes. A site switches from one screening test to another, a cohort changes a memory battery, or a consortium wants to pool data where different forms were used. Raw scores from different forms are not interchangeable. If one form is easier, the same raw score represents a different standing in the study population.

Equating and linking methods are a big family, all of which attempt to address the problem of different measurement instruments. Linking and equating methods have a specific target in mind: after equating, scores from two forms can be used interchangeably for the intended group of test takers.

This post focuses on linear equating. I mention other approaches only when linear equating might not be the best choice. I draw heavily on Livingston (2004) and Bandalos (2018) for source material see references.

3 What equating is, and what it is not

Bandalos (2018) defines equating as a statistical process that adjusts scores from different forms so they can be used interchangeably, and emphasizes that this only makes sense when forms were built to measure the same content and were intended to be parallel.

Livingston (2004) offers a useful general definition of equated scores: a score on a new form and a score on a reference form are equivalent for a specific group of test takers if they represent the same relative position in that group.

Two implications matter in practice:

  1. Equating is group-defined. Equating targets a population (or defined subpopulation), not an individual-specific correction. Bandalos notes that group invariance implies equating adjusts scores for groups rather than individual examinees. The equating function is estimated from group score distributions and can be applied to an individual score, but it depends only on the observed score (and the target group used to define the mapping). It may not transport to other populations, especially those with very different score distributions.

  2. Equating is symmetric and is not regression prediction. If score x on Form A equates to score y on Form B, then y on B must equate back to x on A. Livingston uses this symmetry requirement to contrast equating with statistical prediction, which is not symmetric because of regression to the mean.

4 How do we do equating?

Linear equating is one of the simplest equating methods. It uses only means and standard deviations from the two forms in the target group to define a linear mapping. It is transparent and easy to explain. But is it applicable to your data?

4.1 The design question comes first: are the groups equivalent?

Before you pick an equating method (linear, equipercentile, etc.), you need an equating design (how the data were collected).

Linear equating is cleanest under a randomly equivalent groups design, where group differences reflect only random variation. Ideally, a single sample was randomly into two samples, one taking Form X and the other Form Y. If groups are not equivalent by design, you have three options:

  • Induce equivalence using measured covariates (for example, weighting).
  • Use an anchor test (NEAT designs) and shift to methods like Tucker, Levine, or chained equating.
  • Admit that the data cannot support a defensible link and revisit the measurement plan.

The flowchart in the accompanying figure is a practical way to walk those choices.

5 Linear equating as a z-score identity

Linear equating is a mean and variance alignment. Under the assumption that the two forms are parallel in the sense required for equating, linear equating maps a raw score on Form X onto the Form Y scale by matching standardized scores.

Let:

  • \(X\) be the raw score on Form X, with mean \(\mu_X\) and standard deviation \(\sigma_X\) in the target group.

  • \(Y\) be the raw score on Form Y, with mean \(\mu_Y\) and standard deviation \(\sigma_Y\) in the same target group.

The expectation is that the standardized scores (z-scores) are equal for tests X and Y because the groups are assumed to be randomly equivalent, and the tests measure the same construct.

The core identity is:

\[ \frac{x - \mu_X}{\sigma_X} = \frac{y - \mu_Y}{\sigma_Y}. \]

Solving for \(y\) gives the linear equating function:

\[ \hat{y}(x) = \mu_Y + \frac{\sigma_Y}{\sigma_X}(x - \mu_X). \]

Interpretation:

  • The slope \(\sigma_Y/\sigma_X\) rescales the spread.

  • The intercept \(\mu_Y - (\sigma_Y/\sigma_X)\mu_X\) aligns the means.

This is often called mean/sigma equating. The table that accompanies this post summarizes the typical use case and key limitations (for example, shape mismatch and impossible scores outside the valid range).

6 A numerical example

Assume you spiraled Forms X and Y within the same cohort and estimated these target-group summaries:

  • \(\mu_X = 20\), \(\sigma_X = 5\)

  • \(\mu_Y = 18\), \(\sigma_Y = 6\)

A participant scored \(x = 24\) on Form X. The linear equated score on the Y scale is:

\[ \hat{y}(24) = 18 + \frac{6}{5}(24 - 20) = 18 + 1.2 \cdot 4 = 22.8. \]

This illustrates two routine issues:

  • Non-integer results. If Y is discrete, you must decide how to report 22.8. Livingston discusses this as a practical problem of discreteness and rounding.

  • Out-of-range values. If X has a wider range than Y, linear equating can produce values below the minimum or above the maximum.

6.1 Minimal R code

mu_x <- 20; sd_x <- 5
mu_y <- 18; sd_y <- 6

x <- c(12, 20, 24, 30)

y_hat <- mu_y + (sd_y / sd_x) * (x - mu_x)
data.frame(x = x, y_equated = y_hat)
   x y_equated
1 12       8.4
2 20      18.0
3 24      22.8
4 30      30.0

If your reporting scale is discrete, you can add a rounding rule after you decide what “one point” means on that scale.

7 The critical assumption: similar distribution shape

Linear equating matches means and standard deviations. It does not match higher moments. If the score distributions differ in shape, linear equating will be biased in ways that show up at the tails.

This is common in cognitive testing. Ceiling effects can create strong skewness in one form but not the other. In that setting, an equipercentile mapping is often the better first step.

8 Equipercentile equating when shapes differ

Equipercentile equating defines equivalence by matching percentile ranks. It aligns the entire distribution, not just its first two moments. The methods table summarizes this use case and highlights its tradeoffs, including sample size demands and the fact that you cannot equate beyond the observed range. A simple way to say it:

  • Find the percentile rank of score \(x\) in the Form X distribution.

  • Find the Form Y score \(y\) at that same percentile.

  • Map \(x \mapsto y\).

This definition fits Livingston’s general idea of “same relative position,” with “relative position” instantiated as a percentile rank.

I will discuss equipercentile equating in more detail in a later post.

9 Inducing equivalence in observational data

Many cohort comparisons are not randomized. If Form X was used in an older, less educated cohort and Form Y in a younger cohort, then the observed score differences reflect both form difficulty and population differences. Linear equating does not fix that by itself.

One approach is to induce covariate balance so that the groups look like they came from the same target population, at least with respect to measured variables.

Overlap weighting is one such method. Thomas and Pencina (2020) describe overlap weighting as assigning weights proportional to the probability of belonging to the opposite treatment group. Treated units get weight \(1 - \text{PS}\) and untreated units get weight \(\text{PS}\), which downweights observations with extreme propensity scores and emphasizes the region of overlap.

This matters for linking because it makes the “equating group” explicit. It is no longer the full sample. It is the overlap population where both forms plausibly could have been observed.

Two cautions:

  • Like any propensity score method, overlap weighting only adjusts for measured covariates included in the propensity model.

  • The interpretation is tied to a population “at equipoise,” which is precisely the overlap region.

10 Where linear equating fits in the broader toolbox

When the groups are equivalent (by design, or after a defensible balance step) and the distributions are similarly shaped, linear equating is often the right starting point. It is transparent, easy to audit, and easy to explain.

When those conditions fail, you usually need either:

  • Equipercentile equating for shape mismatch, or

  • Anchor-test designs (NEAT) and methods like Tucker, Levine, or chained equating when groups are not equivalent and you need to use common-item performance rather than covariates alone.

I will cover Tucker, Levine, and chained equating in a later post, and as I promised earlier, I will also cover equipercentile equating in a later post. The key point for this one is that linear equating is not a default. It is a method with a clear set of assumptions, and it works well when you can defend those assumptions.

11 References

Bandalos, D. L. (2018). Test equating. In Measurement theory and applications for the social sciences (pp. 547–584). Guilford Press.

Livingston, S. A. (2004). Equating test scores (without IRT). Educational Testing Service.

Thomas, L. E., & Pencina, M. J. (2020). Overlap weighting: A propensity score method that mimics attributes of a randomized clinical trial. JAMA, 323(23), 2417–2418.

Footnotes

  1. Bandalos (2018) discusses the difference between equating and linking in more detail. Equating is the stronger claim: scores are adjusted so the forms can be used interchangeably. Bandalos defines equating exactly that way and notes it is only appropriate when forms were built to the same content specifications and intended to be parallel; if content differs substantially, “there is no statistical adjustment” that can make the scores interchangeable. Linking is the weaker claim: a relationship between scores that is useful as an approximation, typically when tests aim at the same basic construct but differ in content and/or difficulty. Bandalos explicitly presents linking as an “equating-like” process meant to provide a rough approximation, and warns that the meaning depends on construct similarity.↩︎