Data coding
May 31, 2026 Reading time ≈ 9 min
You ran a survey with an open-ended question, "What didn't you like about our service?" and collected 800 responses. One person wrote "slow delivery," another "I waited 10 days for my parcel, that's unacceptable," and another "the order arrived late even though they promised 3 days."
Essentially, all three are talking about the same thing. But to a computer these are three completely different text strings that cannot be automatically combined, counted, and compared. To turn this chaos of free-form wording into structured data suitable for analysis, there is a process called coding.
What is data coding
Data coding is the process of assigning numeric or alphabetic codes to text responses, categories, and variables in a study. Coding turns unstructured information (free-form answers, open comments) into a system of labels that can be counted, filtered, cross-referenced, and visualized.
Put simply, coding is translation from human language into the language of tables. When a respondent writes "the manager was rude and didn't help me figure things out," that's human speech. When an analyst assigns this response the codes "Politeness — negative" and "Competence — negative," it becomes data you can work with: counting frequency, building charts, comparing segments.
Coding applies not only to open-ended questions. Closed questions are coded too — it just happens automatically at the questionnaire-design stage. When you set the options "Male / Female" and the system records them as 1 and 2, that's already coding. But the real challenge begins where the answers are free-form.
Why code: what structuring gives you
The ability to count. As long as responses exist as text, all you can do is read them one by one. Coding turns reading into counting: instead of "I read 800 comments and it seemed to me that many people complain about delivery," you say "47% of negative comments relate to delivery, 23% to product quality, 18% to support." That's a completely different level of argument.
The ability to compare. After coding, you can compare the answers of different groups: do city residents or those in the regions complain about delivery more often? New customers or returning ones? Those who gave an NPS of 2, or those who gave a 5? Without coding, such cross-tabulations are impossible — you stay at the level of impressions.
The ability to track change over time. If you run surveys regularly and code open responses using the same scheme, you can see trends: the share of delivery complaints fell from 47% to 31% over two quarters — meaning the logistics changes are working. Without coding, you read comments once a quarter and "feel like things have gotten better." With coding, you know for sure.
The ability to scale. 50 responses can be read by hand. 500 is already hard. 5,000 is impossible without a system. Coding is the system that lets you process thousands of open responses and extract structured conclusions from them.
Coding is the bridge between qualitative data (words, stories, emotions) and quantitative analysis (percentages, charts, tables). Without this bridge, open-ended questions remain a gold reserve you have no access to.
Types of coding
Depending on the task and the type of data, different approaches to coding are used.
Deductive coding
The categories are defined in advance, before the analysis begins — based on hypotheses, prior experience, or a business objective. The analyst reads each response and assigns it one or more predefined labels.
Example. A delivery service knows that the main complaint themes are "speed," "cargo damage," "courier communication," and "wrong address." Before coding, a coding frame is built from these four categories plus "other." Every open response passes through this filter.
When it fits: you already know the problem space well and want to measure the frequency of familiar themes. A typical situation for repeated studies where the categories have become established.
Inductive coding
The categories are not set in advance — they emerge from the data itself. The analyst reads the responses, identifies recurring themes, and formulates categories as the work progresses. This approach is closer to qualitative research and is used when you don't yet know exactly what respondents will say.
Example. A company runs an employee survey for the first time with the open-ended question "What stops you from working productively?" The list of categories is unknown in advance. The analyst reads the first 100 responses and discovers unexpected themes coming up: "constant video calls," "an uncomfortable office chair," "unclear project priorities." These themes become the categories that are then applied to the entire data set.
When it fits: exploratory studies, first surveys on a new topic, situations where you deliberately don't want to constrain the analysis with biased frameworks.
Mixed coding
In practice, a combination is most often used: a starter set of categories is defined in advance (deductively), but as you work with the data the analyst adds new categories that weren't anticipated (inductively). This is a pragmatic approach: you don't start from a blank slate, but you also don't blinker yourself with predefined frameworks.
Numeric coding of closed questions
A separate, more technical task is assigning numeric codes to the options of closed questions for subsequent statistical analysis. For example: "Strongly disagree" = 1, "Disagree" = 2, "Neutral" = 3, "Agree" = 4, "Strongly agree" = 5. Or: "Male" = 1, "Female" = 2. This coding is usually automated at the survey platform level.
How to code open responses: a step-by-step process
Step 1. Read a sample of responses
Don't rush into coding right away. First read 50–100 responses in a row to grasp the overall picture: which themes come up, what tone prevails, whether there are unexpected directions. This is "reconnaissance" before the systematic work.
Step 2. Build a coding frame
Formulate a list of categories (codes). Each category should be:
- Unambiguous — there's no doubt which category a particular response belongs to
- Exhaustive — every response falls into at least one category (for this you need an "Other" category)
- Mutually exclusive — if you've decided that each response gets exactly one code. If a response can contain several themes, allow multiple coding
Example coding frame for the question "What didn't you like?":
- 01 — Delivery speed
- 02 — Packaging quality
- 03 — Product not matching the description
- 04 — Support performance
- 05 — Price / value for money
- 06 — Navigation on the website / app
- 07 — Payment methods
- 08 — Other
- 09 — No complaints / liked everything
Step 3. Code all the responses
Go through each response and assign a code (or several). If a response doesn't fit any category, place it in "Other." If "Other" accumulates more than 10–15% of responses, reconsider your categories: you've most likely missed an important theme, and a separate code needs to be split out of "Other."
Step 4. Check consistency
If one person is coding, reread the first 50 coded responses after you finish everything. By this point your understanding of the categories may have sharpened, and the early codes may need adjusting.
If several analysts are coding, calculate intercoder reliability: give the same sample of 30–50 responses to two coders independently and see whether their codes match. Agreement below 80% is a signal that the categories are not formulated clearly enough and need refining.
Step 5. Analyze
Now that each response has a code, you can do what the whole effort was for: count frequencies, build charts, compare groups, track change over time. Did the code "01 — Delivery speed" appear in 47% of negative responses? That's the main problem. The code "07 — Payment methods" in 3%? That's not a priority.
Automated coding and AI
Manual coding is accurate but slow. At large volumes (thousands of responses) it becomes the bottleneck of the entire study. Automated text-processing technologies come to the rescue.
Dictionary-based coding. The simplest approach: a dictionary of keywords is built for each category. If a response contains "deliver*," "courier," "brought*" — the code is "Delivery." Fast but crude: sarcasm, negations, and complex constructions are beyond the dictionary's understanding. "The delivery was wonderful" and "the delivery is a nightmare" will get the same code.
AI-assisted coding. Modern language models can classify text responses taking into account context, tone, and hidden meaning. They distinguish "the delivery is great" from "the delivery is terrible," process thousands of responses in minutes, and improve with each iteration. The optimal strategy is hybrid: AI performs the initial classification, and a human checks the disputed cases and corrects errors.
Sentiment analysis is a special case of automated coding in which each response is assigned an emotional tone: positive, negative, or neutral. This is the most basic level of coding, but for many tasks it's enough — for example, to track the share of negative mentions over time.
Data coding and SurveyNinja
In SurveyNinja, open-ended text responses are collected and stored in the analytics section, from which they can be exported for coding.
Export to CSV/Excel. All responses, including the text from open fields, can be exported in table format. The text of the response is in a separate column, with answers to closed questions and demographics alongside it. This is a ready-made basis for coding work in Excel, Google Sheets, or specialized tools.
Built-in AI assistant. SurveyNinja includes AI functionality that can be used for the initial processing of text data — identifying key themes and grouping similar responses.
Filtering in analytics. Using the filters in the interface, you can view text responses by segment: separately from detractors (NPS 0–6) and separately from promoters (9–10). This speeds up manual coding: instead of reading 800 responses in a row, you work with target groups.
Coding is the work everyone wants to skip but can't. Without it, open-ended responses remain a collection of stories that everyone interprets in their own way. With coding, they turn into arguments backed by numbers.
Published: May 31, 2026
Mike Taylor