Test-retest reliability

Q: What interval between measurements should I choose?

2-4 weeks is optimal. Shorter — the memory effect inflates the correlation. Longer — the measured characteristic may genuinely change. For dynamic characteristics the interval is shorter, for stable ones it can be longer.

Q: Can I run the second measurement on a different sample?

No — that is no longer test-retest. Classic test-retest requires the same people in both measurements. Otherwise you cannot correctly compute the correlation between paired answers.

Q: What should I do if test-retest reliability is low?

Analyze it: is it a problem of specific questions or of the whole scale. Individual questions — reword or replace them. The whole scale — perhaps the instrument measures too variable a characteristic. Check the context between measurements.

Q: Which coefficient should I use for different data types?

For continuous scales — Pearson or ICC. For ordinal ones — Spearman. For nominal ones — Cohen's kappa or percent agreement. ICC is the most universal choice.

Q: Do I need to check test-retest for known validated scales?

For standard scales in their original form and on a comparable audience — usually not. With substantial adaptation or translation the check is worth running. Validation does not transfer automatically between contexts.

Mike Taylor May 31, 2026 Reading time ≈ 9 min

You measured employee engagement — 72 points. A week later you ran the same survey on the same people — 65.

What happened? Maybe engagement really dropped. Or maybe the instrument is simply unstable and produces a random result every time. Test-retest reliability answers exactly this question: when the same people take the same survey twice, how similar are the results? It is the basic check of whether your instrument measures anything stable at all.

Definition

Test-Retest Reliability — the property of a measurement instrument to produce similar results when the same trait is measured again in the same people after some interval of time, provided the measured trait has not changed. It is assessed through the correlation between the first and second measurement. A high value means the instrument measures a stable characteristic rather than random noise.

Why a test-retest check matters

Reliability is a necessary condition for validity. If an instrument produces unstable results, it cannot be valid — even if, in theory, it measures the right characteristic. There are at least three reasons to run the check:

Evaluating the instrument itself. A new questionnaire, index or scale must be checked for stability before mass use. Unstable results = you cannot base decisions on them.

Choosing between instruments. If you have several alternative scales for measuring the same construct, test-retest is one of the selection criteria. A stable questionnaire is more dependable than a temperamental one.

Interpreting changes over time. When you compare two survey waves and see "NPS down 5 points", you need to understand whether this might be a real change or simply falls within the instrument's instability.

The test-retest procedure

Steps to follow:

1. Select a sample. A minimum of 30-50 people is recommended, ideally 100+. These should be real representatives of the survey's target audience, not abstract "volunteers".

2. Run the first measurement. A standard survey procedure — respondents complete the questionnaire under normal conditions.

3. Wait out the interval. The optimal one is 2 to 4 weeks. Too short (a day or two) — respondents remember their answers and reproduce them from memory rather than truly answering again. Too long (several months) — the measured characteristic may genuinely change.

4. Run the second measurement. The same respondents, the same questionnaire, the same distribution conditions. Important — identical conditions: you cannot run an online survey first and a phone survey second.

5. Compute the correlation. Between the paired answers (the first and second measurement of each respondent). For continuous scales — the Pearson coefficient. For categorical ones — the Spearman coefficient or Cohen's kappa. For comparing means or overall indices — the intraclass correlation coefficient (ICC).

Interpreting the coefficients

Thresholds for test-retest reliability:

r ≥ 0.9 — excellent reliability. The standard for clinical instruments and high-stakes decisions.
0.8 ≤ r < 0.9 — good. Suitable for most applied surveys.
0.7 ≤ r < 0.8 — acceptable. The minimum for serious use.
r < 0.7 — low. The instrument needs refining or should not be used for decision-making.

The thresholds are guidelines. For long validated questionnaires (MBI, Big Five) 0.8+ is expected. For a short 3-question pulse survey a reliability of 0.7 may be acceptable.

Example: checking a satisfaction scale

An HR team developed an 8-question job satisfaction scale. Before rolling it into the quarterly survey, they decided to check its stability.

Sample: 60 employees. First measurement on Monday. Repeat — 3 weeks later. Results:

Pearson correlation between total scores: r = 0.84
By individual questions: from 0.52 to 0.91
The question "my office is a comfortable place to work": r = 0.52 — unstable

Conclusion: the overall scale is reliable (0.84 — good), but one question is unstable. The decision: reword the problematic question or replace it. After refinement — another check on a new sample.

What can lower test-retest reliability

Real changes in the measured characteristic. In the 3 weeks between measurements something may have happened: a reorganization, a new project, a change of manager. In this case a low correlation does not point to a bad instrument — it reflects real dynamics. Take into account the context between measurements.

A learning or memory effect. Respondents remember their previous answers and reproduce them automatically. This artificially inflates reliability. The opposite extreme: respondents try to answer "differently" so as not to repeat themselves — this deflates the correlation. Both effects are softened by a 2-4 week interval.

Inattention or fatigue. If a respondent took the first survey thoughtfully and the second one "just to be done with it", the results will diverge. The control: assess the completion time, exclude speeders and inattentive respondents.

Unstable measurement conditions. The first measurement in the morning, the second in the evening on a Friday. The first in a calm setting, the second on the run. Conditions should be comparable.

Wording that is too general or abstract. Questions like "how satisfied are you with life overall?" yield less stable answers than concrete behavioral indicators. General self-assessments fluctuate more easily with mood.

Test-retest vs other types of reliability

Test-retest is one of several types of reliability. The full picture includes:

Test-retest reliability — stability over time
Internal consistency (Cronbach's alpha) — consistency between the items of a single scale
Inter-rater reliability — consistency of ratings from different raters
Parallel forms reliability — consistency between two equivalent versions of a questionnaire

These types of reliability check different aspects. A high alpha does not guarantee a high test-retest (a scale can be consistent yet unstable over time) and vice versa. To validate an instrument it is advisable to check several types.

Common mistakes during the check

Too short an interval. Running the second measurement after 2 days gets you an artificially inflated correlation thanks to memory. The minimum is 2 weeks.

Different measurement conditions. The first time as part of the general company survey, the second only as a "test". Different context, motivation, attention. Conditions should be as identical as possible.

Too small a sample. A correlation on 15 people has a wide confidence interval — the figure could be 0.5 as easily as 0.9. For an accurate estimate — a minimum of 30-50, better 100+.

Confusing it with real changes. If something significant happened between measurements (a change in the company, external events), a low correlation may reflect real dynamics rather than a problem with the instrument. Document the context.

Test-retest in survey practice

For applied tasks a full test-retest check is run once, during the piloting of a new questionnaire. After that the instrument is used without re-checking — its reliability is assumed to have been established.

The exceptions are a substantial change of population (a new country, a new industry) or the translation of a questionnaire into another language. In these cases reliability needs to be checked anew: what worked on American students may not work on workers in another country.

When planning a questionnaire in SurveyNinja: if you are developing a new scale — be sure to include a check on a small sample. Through a pilot study you can simultaneously check test-retest, internal consistency and the clarity of the wording. For Likert scales and indices this is especially important — they are precisely the ones that most often turn out to be unstable without a check.

Test-retest reliability is the check of whether your instrument measures something stable rather than random noise. The procedure: repeat the survey on the same people after 2-4 weeks, compute the correlation. Above 0.7 — acceptable, above 0.8 — good. Without this check, any comparisons of survey waves are risky: the changes may turn out to be an artifact of the instrument.

Frequently asked questions

What interval between measurements should I choose?

2-4 weeks is optimal. Shorter — the memory effect inflates the correlation. Longer — the measured characteristic may genuinely change. For dynamic characteristics (mood, fatigue) the interval should be shorter; for stable ones (personality traits) it can be longer, up to 2-3 months.

Can I run the second measurement on a different sample?

No — that is no longer test-retest, but a check of consistency between different samples (parallel samples). Classic test-retest requires THE SAME people in both measurements. Otherwise you cannot correctly compute the correlation between paired answers.

What should I do if test-retest reliability is low?

Analyze it: is it a problem of specific questions or of the whole scale? If of individual questions — reword or replace them. If of the whole scale — perhaps the instrument measures too variable a characteristic (the mood of the day rather than a stable trait). Also check whether something in the context changed between measurements.

Which coefficient should I use for different data types?

For continuous numeric scales (points, ratings) — the Pearson coefficient or ICC. For ordinal ones (ranks, categories) — Spearman. For nominal ones (categories without order, e.g. "the chosen answer option") — Cohen's kappa or percent agreement. ICC is the most universal choice and suits most cases.

Do I need to check test-retest for known validated scales?

For standard validated scales (NPS, CSAT, MBI) in their original form and on a comparable audience — usually not. But with substantial adaptation (translation, change of wording, a new cultural environment) the check is worth running, even if the base scale is well known. Validation does not transfer automatically between contexts.

Published: May 31, 2026

Create Your Own Survey Today