Contents

Create Your Own Survey Today

Free, easy-to-use survey builder with no response limits. Start collecting feedback in minutes.

Get started free
Logo SurveyNinja

Cluster analysis

After a survey you're left with a table: respondents, dozens of questions, scales and demographics. You'd like to see whether the answers fall into recognizable types — "detractors", "loyalists", "neutrals" or something of your own. But the boundaries of such groups are unknown in advance, and going through hundreds of rows by hand isn't realistic.

This is where cluster analysis helps: a family of methods that group objects by "similarity" so that within a group things are as close as possible, while between groups the difference is clear. The result is a cluster assignment that you can then describe, name and use for segmentation.

It's important to understand: cluster analysis explains nothing and tests no hypotheses — it only splits the data. The outcome depends heavily on which variables you chose and which algorithm you applied. That's why clusters should be checked for stability and interpreted meaningfully; otherwise you'll get a neat picture that is useless for decisions.

What cluster analysis is in plain terms

Cluster analysis is a group of multivariate statistical methods that split a set of objects into subsets (clusters) so that objects within one cluster are similar to each other on the chosen features, while objects from different clusters differ. The number of clusters can be set in advance or selected by criteria. The result is an assignment of "who is in which cluster", which is then used for segmentation, profiling or further analysis.

Put simply: you feed in a table (for example, respondents × scale answers), specify "what similarity is measured on", and the algorithm returns groups. Clusters don't come with names "out of the box" — you give the names and meaning yourself after reviewing the means and distributions of the variables in each cluster.

When cluster analysis is appropriate

  • Segmentation without rigid rules. You need to identify types of customers, users or respondents across many features (behavior, attitudes, demographics), but you don't know in advance how many segments there are or where the boundaries lie. Clustering hints at a possible structure.
  • Data exploration. After a survey there are many variables; you want to see whether the answers "fall into" natural groups. Clusters give a draft of segments that you then refine or validate on new data.
  • Grouping more than just people. You can also cluster objects of another kind: products, questionnaire items, free-text feedback — by numeric or transformed features.

Cluster analysis does not replace a quantitative design with hypotheses: it is descriptive. If you already have clear segmentation criteria (for example, "age and income"), it's easier to split the sample by them or use cross-tabulations. Clustering is useful when there are many features and you're looking for a hidden grouping.

When clustering isn't needed. If the segments are defined explicitly (region, customer type by contract), split by them. If the goal is to test the relationship between two variables, use correlation or regression. Cluster analysis does not answer the question "does X affect Y" — only "how do objects come together into groups".

Main approaches

Hierarchical clustering. A "tree" is built: first every object is its own cluster, then at each step the two nearest clusters are merged. From the tree you can cut off the desired number of clusters. The plus is the intuitive dendrogram; the minus is that with a large number of objects it is laborious and sensitive to outliers.

K-means. You set the number of clusters K; the algorithm iteratively reassigns objects across K centroids so as to minimize the sum of squared distances to the centers. It's fast and scales well, but K has to be chosen in advance (by an elliptical criterion, the "elbow", or on substantive grounds).

The link with factor analysis. Before clustering, the dimensionality of the data is often reduced using factor analysis or PCA: clusters are then built on the factor scores rather than on dozens of original variables. This reduces noise and simplifies interpretation.

The choice of distance metric (Euclidean, Manhattan, correlation, etc.) and of the linkage method (for hierarchy) affects the result. Variables are better standardized, otherwise features with greater variance will dominate.

How to choose the number of clusters. For K-means, K is set in advance. People often plot the "within-cluster sum of squares" against K (the "elbow" method): after a certain K the gain from adding a cluster becomes small. Another option is substantive: "we need 3–4 segments for the product". For hierarchical clustering the number of clusters is chosen by the "cut" of the dendrogram: where the distance between merged clusters rises sharply. The outcome is better validated: with a different K the cluster profiles should not fall apart completely.

Example in the context of surveys

A satisfaction survey: 20 items on a 1–5 scale, plus gender, age, frequency of use. Respondents are rows, variables are columns. After standardization we run K-means with K=3 or K=4. We get three or four clusters. Then we look at the means for each item and the demographics within the clusters: one cluster may turn out to be "detractors" (low ratings, less frequent users), another "loyalists" (high ratings), a third "neutrals". You give these names yourself; the cluster analysis only assigned the labels. The size of the clusters and their stability can be checked on a subsample or with a different algorithm.

Another example: grouping questionnaire items. The objects are not respondents but questions (for example, 30 Likert-scale statements). The features are the average answers to each question in subsamples or the correlations between items. Clustering can show which items "go together" — a draft of scales or thematic blocks. For a finer check of the structure people more often use factor analysis; clustering gives a quick overview.

Interpreting and using clusters

Once you have the cluster assignment, you need to describe and name the clusters. Look at the means (and, where needed, the proportions) for all variables in each cluster: what makes this cluster stand out? Compare the sizes of the clusters: isn't there one "huge" one and several "tiny" ones — in which case the split may be unstable. It's convenient to build profile charts or heatmaps of "cluster × variable". The cluster names ("detractors", "loyalists", "neutrals") are assigned by the researcher based on these profiles; after that the clusters can be used as a grouping variable in cross-tabulations, regressions or segment reports.

Limitations and common mistakes

Clusters don't have to be "real". The algorithm will always produce a split, even if there is no clear grouping in the data. A check is needed: vary K, the method, the subsample — if the structure jumps around a lot, be cautious with conclusions.

Dependence on the set of variables. Add or remove features and the clusters may change. In the report, state which variables and which settings the clustering was run with.

Confusing it with regression and correlation. Correlation and regression analysis answer questions about relationships and prediction. Cluster analysis only groups objects; it does not estimate the "effect" of features and does not predict an outcome.

Ignoring size and representativeness. Clusters are built on the sample you have. If the sample is not representative or the size is small, the segments cannot be transferred to the population without additional assumptions.

Too many variables without selection. Including dozens of features "just in case" inflates the noise and may yield spurious clusters. It makes sense to select variables according to the task or to reduce dimensionality beforehand (PCA, factor analysis), then cluster on a smaller number of components.

How it looks in SurveyNinja

SurveyNinja has no built-in cluster analysis. A typical scenario: export the answers via survey reports to CSV/XLSX, then run the clustering in an external tool (Excel with add-ins, R, Python, SPSS, JAMOVI). It makes sense to first filter the respondents and the variables on which you'll compute similarity; if needed, use coding of open-ended fields and build the clusters on the codes or numeric scales.

Practical recommendations

Define the objects and features clearly. Respondents or something else? Which variables go into the distance calculation? Categorical variables need to be transformed (binary, dummy) or you should use algorithms that allow mixed types.

Standardize the variables. Otherwise features with a large spread will dominate. The exception is when different scales are meaningful by design.

Check stability. Vary K, the method, the random subsample; see whether the clusters remain substantively similar. If not, don't overcomplicate the interpretation.

Describe the methodology in the report. State: the method (K-means, hierarchical, etc.), the number of clusters and how it was chosen, the list of variables, the software tool. Then the reader will be able to assess and reproduce the analysis.

Cluster analysis is a tool for exploration and segmentation without rigid rules: it groups objects by proximity on the chosen features. The result needs to be checked for stability and interpreted meaningfully; for the calculations you use external programs after exporting the data from SurveyNinja.

1