Contents

Create Your Own Survey Today

Free, easy-to-use survey builder with no response limits. Start collecting feedback in minutes.

Get started free
Logo SurveyNinja

IRR (Inter-Rater Reliability)

You collected 500 open-ended responses in a survey and hand them to two analysts to code into the categories "complaint", "praise", "question".

The first analyst finds 120 complaints, the second finds 180. Whom do you trust? If their codes diverge by 30%, then any conclusions based on this classification are questionable. Inter-rater reliability is the metric that formally evaluates whether different experts read the same data the same way.

Definition

Inter-Rater Reliability (IRR) is the degree of agreement among the ratings or classifications assigned by different independent experts (raters) to the same objects or responses. It is used in the coding of open-ended responses, the assessment of qualitative data, 360-degree feedback, and expert evaluations. High IRR means that the categorization or rating is objective and reproducible, rather than dependent on the individual perception of a particular expert.

Why measure IRR

Any data that passes through a person's subjective judgment needs to be checked: do different experts produce the same result? Without IRR you do not know whether the codes reflect the real structure of the data or the individual preferences of a particular analyst.

Three typical scenarios where IRR is critical:

Coding open-ended responses. After a survey with open-ended questions, the answers are classified by topic. If the codes are subjective, the aggregated statistics ("40% mentioned the quality of support") become unreliable.

Qualitative analysis of interviews, focus groups, and feedback. Identifying themes, patterns, and insights is a subjective process where IRR guarantees reproducibility.

Evaluation procedures. Assessments, performance reviews, expert evaluations of work quality. If different raters give different scores to the same person, the process is unfair and uninformative.

How IRR is measured

The choice of coefficient depends on the type of data:

Percent Agreement. The simplest one — the share of cases where the raters agreed. It is intuitive, but it overstates agreement: even a chance match is included in the percentage. It is not recommended as the sole metric.

Cohen's kappa (κ). For two raters and nominal categories. It accounts for agreement net of chance matches. Values range from -1 to 1:

  • κ < 0 — agreement worse than chance (rare)
  • 0 ≤ κ < 0.4 — slight
  • 0.4 ≤ κ < 0.6 — moderate
  • 0.6 ≤ κ < 0.8 — substantial
  • κ ≥ 0.8 — almost perfect agreement

Fleiss's kappa. A generalization of Cohen's kappa for the case of three or more raters.

Krippendorff's alpha (α). A universal coefficient — it works for any number of raters, any types of data (nominal, ordinal, interval), and accounts for missing ratings. It is considered the modern standard.

Intraclass Correlation Coefficient (ICC). For quantitative ratings (numeric scores). It is used in psychometrics and assessments. The thresholds are analogous to kappa.

Example: coding open-ended responses about a service experience

A survey collected 300 open-ended responses to the question "Tell us about your most recent interaction with support". Two analysts independently code each response into one of 5 categories: "positive experience", "neutral", "product issue", "staff issue", "not about support".

Results of the first pass:

  • Percent agreement: 78% (234 out of 300 responses classified the same way)
  • Cohen's kappa: 0.64 — substantial agreement

78% sounds decent, but a kappa of 0.64 is on the edge of acceptable. They reviewed the 66 disputed cases and found that the analysts diverge on the categories "product issue" vs "staff issue" — the boundaries between them are fuzzy. They updated the coding instructions with concrete examples for each category and ran a short calibration workshop. After that, a re-measurement on a new portion of the data: κ = 0.82 — almost perfect agreement. Now the data can be analyzed and decisions can be made based on it.

Procedure for checking and improving IRR

1. Developing the coding scheme. Clear, mutually exclusive categories with definitions and examples. The clearer the scheme, the higher the IRR. Fuzzy boundaries between categories are the main cause of low agreement.

2. Training the raters. Joint calibration on a small sample: working through difficult cases, discussing the principles of categorization, aligning interpretations.

3. Pilot coding. Both raters independently code 30-50 cases. IRR is calculated. If it is low — review the discrepancies, refine the scheme, recalibrate.

4. Main coding. Once the pilot IRR is acceptable (κ ≥ 0.7), the raters code the entire dataset. Part of the data (10-20%) is overlapped by both raters to control stability.

5. Periodic checks. For long-running work — reassess IRR every 100-200 units to catch "drift" in interpretations.

What affects IRR

Clarity of categories. Fuzzy or overlapping categories are the main cause of low agreement. "Negative experience" and "dissatisfaction" can mean almost the same thing — this creates constant discrepancies.

Number of categories. More categories → lower agreement. 3-5 categories usually yield a higher IRR than 15. If you need a detailed classification — make it two-level: first the base categories (high IRR), then subcategories within them (a more complex agreement).

Experience of the raters. New analysts produce more variable codes. Preliminary calibration and joint discussion of the first cases increase IRR.

Complexity of the material. Long, multi-topic responses are coded with less agreement than short, unambiguous ones. For complex data, multi-label coding (several tags per response) may be required instead of a single category.

IRR vs other types of reliability

IRR complements other reliability metrics:

For subjective ratings, IRR is a critical metric. For standardized scales (choosing an option from a list) IRR is less important — there the key reliability is internal and temporal.

Common mistakes when working with IRR

Using only percent agreement. For data with an uneven distribution of categories (90% of responses are "neutral") even a random choice will yield a high percent agreement. Kappa corrects for this bias and gives a fairer estimate.

Raters discuss cases during the coding process. If the analysts consult each other as they work, they artificially align their codes, and IRR measures not independent agreement but collective discussion. Independence of the ratings is critical.

Not fixing the coding scheme. The scheme must be fixed before coding begins and must not change in the process. If categories appear or are refined along the way — the previously coded cases need to be re-evaluated.

Not documenting the discrepancies. Analyzing cases of disagreement is valuable diagnostics: it shows where the scheme is ambiguous, which types of responses are systematically confused, and where refinements are needed. Without this analysis, IRR turns into a number with no practical conclusions.

IRR in survey-based research

When working with survey data, IRR is applied mainly in the analysis of open-ended responses and qualitative research. The standard practice: two independent analysts code a sample of responses, Cohen's kappa is calculated, the scheme is refined if necessary, and then the entire dataset is coded by one analyst with periodic re-checks on a sample.

IRR is also part of triangulation in mixed-method research designs: when quantitative data is supplemented with a qualitative analysis of open-ended responses or interviews, the reliability of coding the latter must be documented. Without this, the conclusions of the qualitative part cannot support the quantitative results.

Inter-rater reliability is an objective measure of how consistent the ratings of different experts are. Without IRR, any qualitative analysis stays in the zone of "that's how the analyst saw it". With it, it becomes a reproducible procedure. A Cohen's kappa of 0.7 is an acceptable threshold; above 0.8 is the standard for serious research. Low IRR is a signal to refine the coding scheme, not to settle for the disagreements.

Frequently asked questions

How many raters are needed to check IRR?

At least two — this is the classic case for Cohen's kappa. Three or more — Fleiss's kappa or Krippendorff's alpha is used. In practice, two are enough for most tasks: the gain in the reliability of the estimate from adding a third rater is small, while the cost of the work doubles.

What threshold of IRR should be considered acceptable?

Cohen's kappa ≥ 0.6 — the minimum for research purposes. ≥ 0.7 — the standard threshold for applied use. ≥ 0.8 — high agreement, suitable for making important decisions. Below 0.6 — the scheme or the raters need refinement.

What should you do if IRR is low?

Do not "average" the ratings, but examine the discrepancies. Analyze the cases of disagreement — you will find where the scheme is ambiguous or where the raters interpret it differently. Refine the category definitions, add examples, run a calibration session. After that — a re-measurement of IRR on a new sample.

Can IRR be used for a single rater?

No — by definition IRR requires at least two independent raters. For a single rater you can check intra-rater reliability: the same person codes the same data again after some time. This is a check of the stability of individual work, not of agreement between experts.

Do you need to check IRR for simple categories with obvious labeling?

If the categories really are obvious (for example, automatically extracted from structured data) — IRR is not needed. But if human interpretation is involved in the process (text classification, sentiment rating, intent recognition) — IRR is mandatory, even if the task seems simple. "Obviousness" often turns out to be subjective.

1