Effect size

Q: What effect size is considered sufficient for making a decision?

It depends on the cost of the change and the potential benefit. There is no universal threshold. For quick, free changes, even a small effect (d = 0.2) may justify implementation. Define a practically significant threshold before launching the study.

Q: Do you need to calculate effect size when the p-value is non-significant?

Yes. A non-significant p + a large effect size is a signal that the sample is small. A non-significant p + a small effect size is evidence of the absence of significant differences. Without effect size, you cannot distinguish these two fundamentally different cases.

Mike Taylor May 31, 2026 Reading time ≈ 10 min

You tested two wordings of a question on a sample of 2,000 people. The p-value = 0.001 — highly significant. But the difference in mean scores is just 0.1 points out of 10.

Should you change the wording? The p-value says "yes, this is not random". The effect size says "but it is negligibly small". Without effect size, statistics describe whether a difference exists. With it, you learn how much it matters.

Definition

Effect size is a standardized measure of the magnitude of a difference or relationship between variables, independent of sample size. It shows the practical significance of a result: how strongly one group differs from another, or how pronounced the relationship between variables is. Unlike the p-value, which depends on sample size, effect size characterizes only the magnitude of the phenomenon itself. The most common measures are: Cohen's d (for the difference between means), Pearson's r (for correlations), and eta-squared (for ANOVA).

Why you need effect size if you already have p-value

The p-value and effect size answer different questions.

The p-value answers: "Could this difference have arisen by chance, assuming there is none in reality?" With a large sample, the p-value will detect any difference, however small — simply because there is enough data to register it. 10,000 people in an A/B test will show a significant difference of 0.05 points in NPS.

Effect size answers: "How large is this difference?" It does not depend on sample size — the same real difference yields the same effect size at n=50 and n=5,000. This makes it comparable across studies.

Four possible combinations:

Significant p + large effect → the difference is real and important
Significant p + small effect → the difference is real but practically insignificant
Non-significant p + large effect → the sample may be too small; worth repeating with a larger n
Non-significant p + small effect → there is most likely no difference

Only the second row is the trap you fall into without effect size. Huge samples produce significant results out of insignificant differences.

Cohen's d: effect size for comparing two means

Cohen's d is the most common effect size measure when comparing two groups. It is calculated as the difference between means divided by the pooled standard deviation:

d = (M1 - M2) / SD_pooled

The interpretation by Cohen (1988), which has become the standard:

d = 0.2 — small effect. The groups overlap by about 85%. In practice, almost imperceptible.
d = 0.5 — medium effect. Overlap ~67%. Noticeable on careful observation.
d = 0.8 — large effect. Overlap ~53%. Obvious to the naked eye.

Important: Cohen's thresholds are guidelines, not rigid rules. In medicine, an effect of d = 0.2 may be clinically significant. In a marketing A/B test, d = 0.5 may not justify the cost of a change. The context of the task matters more than abstract thresholds.

Other effect size measures

Pearson's r — for correlation analysis and some non-parametric tests. Range from -1 to +1. Guidelines: |r| = 0.1 — small, 0.3 — medium, 0.5 — large effect.

Eta-squared (η²) — for ANOVA. The proportion of variability in the dependent variable explained by the factor. Guidelines: 0.01 — small, 0.06 — medium, 0.14 — large. Omega-squared (ω²) is a more precise version, less biased on small samples.

Odds Ratio and Risk Ratio — for categorical data and binary outcomes. Often used in medical and sociological research.

Glass's delta — a variant of d for when group variances differ substantially: it is normalized only by the standard deviation of the control group, not the pooled one.

Example: effect size in an A/B test of a CTA wording

A company tests two variants of a call to action in a survey. It measures willingness to recommend (an NPS question, 0-10 scale).

Variant A (n=500): mean 7.2, SD 2.1
Variant B (n=500): mean 7.5, SD 2.0

T-test: t = 2.14, p = 0.033 — statistically significant.

Cohen's d: (7.5 - 7.2) / 2.05 = 0.146 — small effect.

Conclusion: the difference is real (not random) but very small. A difference of 0.3 points on a ten-point scale is unlikely to change real business metrics. The decision to switch to variant B requires assessing the cost of the change: if the change is free, it can be implemented. If it requires significant resources, it is most likely not justified.

Effect size and sample size calculation

Effect size is a key input parameter when planning sample size. Before launching a study, you need to answer: what is the minimum effect that is practically important to you?

If you are interested only in a large effect (d ≥ 0.8), a small sample is enough. If you want to detect a small effect (d = 0.2), you need a sample 10-15 times larger at the same statistical power.

Formally, this is tied to the concept of MDE (minimum detectable effect): you set a threshold of practical significance, and the sample size calculation determines how many people are needed to detect an effect of that magnitude at a given power (usually 80%).

The reverse situation — when data has already been collected, the test is non-significant, but the effect is moderate — indicates an insufficient sample. This is not "there is no result", it is "we did not have enough data to register it".

Common mistakes when working with effect size

Ignoring effect size when the p-value is significant. This is exactly what most practicing researchers do. A significant test + an uncalculated d = an incomplete analysis. Adding one line to the report ("Cohen's d = 0.18, small effect") is a small effort with great analytical value.

Applying Cohen's thresholds mechanically. "d = 0.2 is small, therefore unimportant" is an oversimplification. Context determines the interpretation. A small improvement in conversion across an audience of millions = millions of dollars. A small reduction in patients' pain = clinically significant. A small effect is not a synonym for an unimportant one.

Comparing effect sizes from studies that use different measures. d = 0.5 and r = 0.5 are not the same thing. There is a conversion formula between them, but you cannot compare them directly. In a meta-analysis, all effects are brought to a single metric.

Not reporting a confidence interval for the effect size. Like any sample-based estimate, d has a margin of error. A confidence interval of d = [0.12; 0.68] is far more informative than a point estimate of d = 0.4. With a small sample, the intervals are very wide — this is important information about the precision of the estimate.

Effect size in survey research

In survey research, effect size is especially important when comparing scores between audience segments, analyzing changes in metrics over time, and A/B testing questions or formats. Statistical significance without effect size is an incomplete picture, especially when the sample is large.

For a quick significance check and calculation of basic effect parameters, use the A/B test significance calculator from SurveyNinja. It calculates the p-value and helps assess whether the sample is large enough to detect the desired effect.

Effect size translates statistics into practical meaning. The p-value answers the question "is this random?". Effect size answers "does this matter?". A full analysis requires both: significance without effect size is like knowing that a difference exists but not knowing how large it is.

Frequently asked questions

How does Cohen's d differ from the difference between means?

The difference between means depends on the measurement scale. A 2-point difference on a 1-10 scale and a 2-point difference on a 1-100 scale are completely different in magnitude. Cohen's d standardizes the difference by the standard deviation, making it comparable across different scales and studies.

What effect size is considered sufficient for making a decision?

It depends on the context: the cost of the change, the potential benefit, and the baseline level of the metric. There is no universal threshold. For quick, free changes, even a small effect (d = 0.2) may justify implementation. For costly ones, a moderate or large effect is required. Define a practically significant threshold before launching the study, not after.

Do you need to calculate effect size when the p-value is non-significant?

Yes, and it is especially important. A non-significant p + a large effect size is a signal that the sample is small. A non-significant p + a small effect size is evidence that there really is no difference, or that it is negligible. Without effect size, you cannot distinguish these two fundamentally different cases.

How do you calculate Cohen's d manually?

Subtract one mean from the other and divide by the pooled standard deviation. SD_pooled = √[(SD1² + SD2²) / 2] for equal samples. For unequal samples, a weighted average of the variances: √[((n1-1)·SD1² + (n2-1)·SD2²) / (n1+n2-2)]. The sign of d shows the direction of the effect; its absolute value shows the magnitude.

Can you compare effect sizes across different studies?

Yes, this is one of the main advantages of standardized measures. That is exactly why effect size is used in meta-analyses: the results of dozens of studies are brought to a single scale and aggregated. The key condition is to use the same measure or to convert correctly between them.

Published: May 31, 2026

Create Your Own Survey Today