Hypothesis testing
May 31, 2026 Reading time ≈ 10 min
You changed the wording of a survey question - and the response rate climbed from 23% to 27%. Is that a real improvement, or just luck of the sample?
You rephrased a button on a landing page - and conversion rose by 2 percentage points. Signal or noise? Hypothesis testing is the statistical tool that lets you answer this question not with a vague "seems like it," but with a concrete level of confidence.
Definition
Hypothesis testing is a statistical procedure that lets you accept or reject an assumption about a population based on sample data. The procedure formalizes the question "is this random or not?" through a null hypothesis (H0), an alternative hypothesis (H1), a significance level, and a p-value. If the p-value falls below the chosen significance threshold, the null hypothesis is rejected.
Null and alternative hypotheses
Every test begins by stating two hypotheses. The null hypothesis (H0) is the assumption that there is no effect or difference. "The new question wording does not affect the response rate," "There is no difference in satisfaction between the groups." H0 is what the statistics try to disprove.
The alternative hypothesis (H1) is the assumption that an effect exists. "The new wording increases the response rate," "Group A is more satisfied than group B." H1 is accepted if the data provide enough evidence against H0.
An important nuance: statistics never "prove" H1. They only show how unlikely the observed data are, assuming H0 is true. The smaller that probability, the stronger the grounds to reject H0.
P-value and significance level
The p-value is the probability of obtaining a result at least as extreme as the one observed, if the null hypothesis is true. If p = 0.03, it means: if H0 were true (no effect at all), such a result or a more extreme one would occur in only 3% of cases.
The significance level (α) is a threshold chosen in advance, below which the p-value is treated as sufficient grounds to reject H0. The standard in most research is α = 0.05 (5%). In medicine and science it is 0.01 or 0.001. In business analytics, 0.1 is sometimes used for quick decisions.
The threshold is chosen before data collection, not after. Tuning α to fit a result you have already obtained is p-hacking, a form of data manipulation that produces false-positive conclusions.
Type I and Type II errors
Two kinds of errors are unavoidable in any statistical test:
A Type I error (false positive) is rejecting H0 when it is actually true. "Finding an effect where there is none." The probability of this error equals α. At α = 0.05, on average every 20th significant result is due to chance.
A Type II error (false negative) is failing to reject H0 when it is false. "Missing a real effect." The probability of this error is β, and 1 - β is called the statistical power of the test. The larger the sample, the lower the β and the higher the chance of detecting a real effect. For more on the link with sample size, see the article on sample size.
In survey research practice, a Type II error is often the more dangerous one: with a small sample, a real effect is lost in the noise, and the company concludes "there is no improvement" when in fact there is.
How to frame hypotheses in survey research
A good hypothesis is specific, testable, and stated before data collection. A few examples from survey practice:
- Comparing groups: "Satisfaction among customers who use live chat is higher than among those who contact support by phone" - tested by comparing the mean scores of the two groups.
- Relationship between variables: "Frequency of product use is positively correlated with NPS" - tested through correlation analysis.
- Change over time: "After the interface update, the completion rate rose" - tested by comparing the metrics before and after.
A poor hypothesis: "Users are generally happy with the product." That is not a hypothesis but an assumption with no precise criterion. To turn it into a testable one: "The average satisfaction score exceeds 4 out of 5 in the target segment."
Which statistical test to choose
The test is chosen depending on the type of data and the structure of the comparison. For survey research, three scenarios are most common:
Comparing two groups by their means. For example, the average satisfaction score among customers A vs B. A t-test for independent samples is used. Condition: the data are approximately normally distributed, or the sample is large enough (n > 30). If you are comparing answers from the same people before and after - a paired t-test.
Comparing three or more groups. Three regions, four segments, five products. Here a t-test is not suitable - it is not designed for multiple comparisons. ANOVA (analysis of variance) is used. If ANOVA shows a significant result, additional post-hoc tests (Tukey, for example) determine exactly which pairs differ.
Comparing proportions. "The percentage of satisfied customers in group A vs group B" or "The share of people who completed the survey across two versions of the form." A z-test for proportions or a chi-square test is used. Chi-square also works for testing the independence of two categorical variables - for example, whether a respondent's job title is related to their level of engagement.
Nonparametric tests. When the data are not normally distributed and the sample is small, you use nonparametric counterparts: the Mann-Whitney test instead of a t-test, the Kruskal-Wallis test instead of ANOVA. They work with ranks rather than values and are less sensitive to outliers - which matters for scale questions with 5-7 gradations.
Power analysis: planning before data collection
The power of a test (1 - β) is the probability of detecting an effect if it really exists. The standard target level is 80%. This means: when an effect genuinely exists, the test will miss it in 20% of cases.
Power depends on four parameters: sample size, the significance level α, the expected effect size, and the variance of the data. A power analysis lets you calculate the required sample size before the study begins - so that, at the chosen α and the expected effect, the test has enough power.
A practical example: you expect a new onboarding flow to raise NPS by an average of 5 points. The standard deviation of NPS in your base is around 20 points. At α = 0.05 and 80% power you need about 250 people in each group. If you recruit 50 each, power drops to 30%, and a real effect goes unnoticed in 70% of cases. A sample size calculator is available in the SurveyNinja tools.
Hypothesis testing in A/B testing
A/B testing is one of the most common scenarios for applying hypothesis testing. Two versions (control and experimental) are shown to random groups, then the target metric is compared. H0: "There is no difference between the versions." H1: "Version B is better than version A."
The critical conditions for a correct A/B test: random assignment to groups, a sufficient size for each group, a test period defined in advance, and a single changed variable. Stopping the test the moment the p-value first drops below 0.05 is a common mistake: it increases the probability of a Type I error. The test should run until the pre-calculated sample size is reached.
Example: testing a hypothesis in an NPS survey
A company switched to a new onboarding flow. Hypothesis: "The NPS of users who went through the new onboarding is higher than that of users who went through the old one." Before the change, NPS was measured on 300 users - an average score of 32. After the change - 400 users, an average score of 38. The 6-point difference looks significant. But is it chance or a real effect?
They run a t-test for independent samples. P-value = 0.04, significance level α = 0.05. The p-value is below the threshold, so H0 is rejected. Conclusion: the difference is statistically significant, and the new onboarding is associated with a higher NPS. The confidence interval for the difference: from +1.2 to +10.8 points with 95% probability.
If the sample had been 50 people per group, the p-value for the same difference could have been 0.3, and the conclusion would have been "no effect detected." Not because there is none, but because the small sample did not provide enough power.
One-tailed and two-tailed tests
When framing H1, it is important to define the direction of the test. A two-tailed test checks for any difference from H0: "the groups differ" (no matter in which direction). A one-tailed test checks a specific direction: "group A is better than group B." A one-tailed test is more powerful when the direction of the hypothesis is correct, but if the effect turns out to be in the other direction, the test will not catch it. By default, a two-tailed test is used - it is more conservative and more honest.
Common mistakes in hypothesis testing
Framing hypotheses after looking at the data. HARKing (Hypothesizing After the Results are Known) - when the hypothesis is fitted to a pattern already found. The result looks significant, but in fact it has not been tested.
Confusing statistical and practical significance. With a large sample, even a tiny effect will be statistically significant. A 0.3-point difference in NPS with p = 0.001 is statistically significant but practically meaningless. Always look at the effect size, not just the p-value.
Multiple comparisons without correction. If you test 20 hypotheses at α = 0.05, one will turn out "significant" by chance. With mass testing you need a correction (Bonferroni or FDR) - otherwise the number of false-positive results grows in proportion to the number of tests.
Tools and SurveyNinja
For calculations, SurveyNinja provides a set of statistical calculators: a p-value calculator, an A/B significance calculator, and a sample size calculator. Before launching a survey, it is useful to calculate the required sample size - so that the test has enough power to detect the expected effect.
The AI-powered hypothesis generator helps you frame a hypothesis at the start of a study. The data for testing are collected through surveys with clearly defined metrics - and even at the research design stage it is important to decide exactly which variable will be tested.
Hypothesis testing is a formal way to tell signal from noise. H0 is stated before data collection, the p-value is compared with a pre-chosen α, and the effect size is assessed separately from statistical significance. Without this, "significant" results often turn out to be chance coincidences.
Published: May 31, 2026
Mike Taylor