Contents

Create Your Own Survey Today

Free, easy-to-use survey builder with no response limits. Start collecting feedback in minutes.

Get started free
Logo SurveyNinja

Statistical significance

Picture this situation: last month customer satisfaction was 78%, this month it is 81%. The manager is delighted: "We grew by 3 percentage points!", while the analyst cautiously adds: "The difference fell within the confidence interval, so we cannot be sure this is a real improvement rather than random fluctuation." The question arises: where is the line between noise and genuine change?

Statistical significance answers that question. The concept helps you tell apart changes that could have arisen simply because of sampling randomness from those that, with high probability, reflect real shifts in your audience. Without an understanding of significance it is easy to mistake statistical noise for success.

What is statistical significance in plain language

Statistical significance (Statistical Significance) is an indication that an observed difference or relationship in the data is unlikely to have arisen by chance under the given conditions of an experiment or survey. Formally it is assessed through the p-value: the smaller it is, the lower the chance that the result is merely a coincidence.

Intuitively you can think of it this way: if you repeated a study a thousand times under the same conditions, a statistically significant effect would show up in the vast majority of attempts, not only in a few lucky runs. Significance does not tell you that you are "definitely right", but it shows how convincing your hypothesis is from a mathematical standpoint.

What p-value is and how to read it

In reports on A/B tests, marketing research and analytical articles you often see notation like "p < 0.05". Behind it lies the idea: if there really were no difference between the variants, we would obtain such a (or a more extreme) result no more than 5% of the time.

Low p-value (for example, 0.01). The observed effect is poorly compatible with the hypothesis that "there is no difference at all". The probability that you are seeing such a result simply because of chance is small.

High p-value (for example, 0.3). The data fit the hypothesis that "nothing has changed" quite well, and there are no statistical grounds to claim a significant difference. This does not mean the effect is definitely absent — only that it did not show up convincingly in the current sample.

For more on how to interpret the results of comparative analyses and avoid mistaking real differences for noise, you can read about the sources of random and systematic distortions and how they show up in survey data.

Significance and effect size are not the same thing

It is important to distinguish "is it significant" from "how large is the difference". With a very large sample size, even a tiny difference (for example, 0.5 of a percentage point) can turn out to be statistically significant — simply because you have a lot of data. But from a practical standpoint such a shift may be inconsequential.

Conversely, in small samples you can observe noticeable differences (for example, a 10-point gap in satisfaction) that do not pass the significance test — there is not enough data to confidently distinguish the effect from random fluctuation.

That is why, when working with survey results, it is useful to look at both the p-value and the effect size: how large the difference is in percentage terms, how important it is for the business, and how it relates to the margin of error and the goals of the study. The relationship between the scale of an effect and its practical importance is discussed in more detail in the article "Benchmarking".

How to test significance in surveys

Comparing proportions. For questions with answer options ("yes/no", "satisfied/not satisfied") you usually apply tests for comparing proportions — the z-test or chi-square. The SurveyNinja glossary has the term Z-Test, which describes one of these approaches in detail.

Comparing means. For scale questions (a rating from 1 to 10, a satisfaction index) you use tests for means. In practice this means you do not simply compare an average score of 7.2 against 7.8, but check how compatible such a difference is with the idea that the metrics are "equal".

Multifactor comparisons. When you account for several factors at once (region, customer segment, product type), more sophisticated methods come to the rescue: regression analysis, factor analysis, multivariate models. An introduction to such approaches can be found in the articles "Quantitative research" and "Cluster analysis", which show how to work with multivariate data and group respondents.

Example: significance of the difference between two pricing plans

Suppose you are comparing customer satisfaction across two pricing plans: basic and premium. The survey was taken by 400 customers in each group. In the basic plan 76% of respondents are satisfied, in the premium plan — 82%. On a chart this looks like a tangible difference, and you are tempted to declare a winner.

To understand how robust this difference is, you compute confidence intervals for each proportion and run a test for comparing proportions (for example, the Z-Test). If the calculations show that the p-value is below 0.05, you can say that the probability of seeing such a difference "purely by chance" is low and the effect is statistically significant. If, however, the p-value is high, it is wiser to treat the result with caution and view it as a hypothesis for further testing rather than as a final conclusion.

Detailed examples of how to compare the results of different groups and periods without confusing statistical and practical differences are given in the article "Benchmarking".

Where significance matters most

Comparing periods in tracking studies. In regular research (for example, a quarterly NPS or eNPS) it is easy to see small swings in the metrics and mistake them for a trend. Testing significance helps you separate real changes from random ones, especially when you are looking at many segments and metrics at once.

A/B testing of surveys and communications. When comparing different question wordings, invitations to participate or survey scenarios, it is important to distinguish "this variant looks slightly better on the chart" from "this variant statistically and consistently wins". Without a significance test you may choose a strategy that is not actually better than the alternatives.

Customer and HR decisions with a high cost of error. When survey results influence changes to a product, a service or a company policy, testing significance adds confidence: you are relying not only on individual cases and intuition, but also on a formal assessment of how robust the effect is.

How it looks in reports and in SurveyNinja

Even if you do not run full statistical tests, it is useful to build into your reports elements that help readers avoid overrating small differences. The article "Open-ended and closed-ended questions" shows examples of how to present comparisons and discuss results in a way that does not mislead and does not give random fluctuations the status of a "fact".

In SurveyNinja summary reports and cross-tabulations let you quickly see the differences between groups, and, if necessary, export the data for subsequent testing in statistical packages. The help section on working with reports ("Viewing the survey report") describes how to build basic analytics and comparisons without delving into formulas.

Common mistakes when working with significance

Hunting for a "pretty" p-value. If you repeatedly cycle through segments, wordings and slices until you find a "significant" difference, you run a high risk of stumbling onto a false positive. The more tests you run, the higher the probability of accidentally getting p < 0.05 simply by the law of large numbers.

Substituting correlation for causation. A statistically significant relationship between two metrics does not yet mean that one causes the other. For example, a rise in satisfaction may coincide with the launch of a new feature and with a change in external conditions — and the test by itself will not tell you which of the causes matters more.

Ignoring the sizes of subsamples. In small segments even a significant difference can be unstable. If there are only a few dozen respondents inside a group, one or two atypical answers can shift the result substantially — and a formal significance test does not always reflect the real reliability of the conclusion here.

Practical recommendations

Do not declare every shift an "improvement". Before celebrating a rise in a metric, check how large it is relative to the margin of error and how stable it is over time. Sometimes it is better to wait for another wave of data than to make decisions based on a random spike.

Look at the whole picture. When assessing significance, take into account not only the p-value but also the effect size, the context and the practical importance. A small but stable shift in a business-critical metric can matter more than a large but one-off spike in a secondary indicator.

Be honest in your wording. Instead of categorical "this variant is better", use more careful phrasing: "we observe a statistically significant improvement" or "the differences are within the statistical margin of error". This helps the team develop a more mature attitude to data.

Statistical significance is not a magic stamp of "true" or "false", but a way to assess how robust an observed effect is to random fluctuation. The better you understand its limitations and combine it with practical common sense, the more reliable the decisions you make based on surveys and analytics.

1