IRT: Analysis of Test Items
June 4, 2025 Reading time ≈ 5 min
The content of the article
What is IRT
Item Response Theory (IRT) is an approach or family of statistical models used for analyzing data obtained from testing, surveys, or other forms of assessment, where attention is given to each individual item or question. Unlike traditional classical test theory, which focuses on overall test results, IRT examines the probability of a correct or desired response to a specific item based on the abilities or characteristics of the respondent.
The main components of IRT models include:
- Item difficulty (b). This indicates how difficult the item is; items with a higher b value are more difficult.
- Item discrimination (a). This measures how well the item distinguishes between respondents with different levels of ability or the trait being measured by the test.
- Guessing probability (c). In some models, this parameter accounts for the probability that a respondent may guess the correct answer to an item with multiple-choice options.
- Respondent ability (θ). This is a characteristic of the respondent that influences the probability of answering an item correctly. It can be general ability, knowledge, or another measurable trait.
IRT models are used in various fields, including education, psychology, and medicine, for the creation and analysis of tests, surveys, and rating scales. They enable more precise and differentiated testing, the creation of adaptive tests where subsequent items are chosen based on previous answers, and the assessment of changes in abilities or traits over time.
What is the purpose of IRT assessment?
The assessment based on Item Response Theory (IRT) is used in various fields and for multiple purposes, including the following key applications:
- Test development and analysis. IRT allows the creation of tests and the evaluation of their quality by analyzing each item individually. This helps in selecting items that best measure the desired traits or abilities, thus improving the reliability and validity of the test.
- Adaptive testing. One of the most well-known applications of IRT is computerized adaptive testing (CAT), where the next questions in the test adapt to the respondent's ability level based on their previous responses. This allows for more accurate measurement of the respondent's abilities with fewer questions, reducing the overall test time.
- Evaluation and comparison of items and tests. IRT provides tools for evaluating the parameters of items such as difficulty and discrimination. This allows comparing items and tests over time or across different groups without the need for administering them to all respondents directly.
- Research and development of educational programs. IRT is used to analyze learning outcomes and the effectiveness of educational programs. Understanding how students respond to specific items can help in developing more effective teaching materials and strategies.
- Cross-cultural studies. IRT can be used to adapt and compare tests for use in different cultural and linguistic contexts, ensuring fairness in testing and comparability of results.
- Diagnostics and clinical assessment. In medicine and psychology, IRT is used to create and analyze diagnostic tests and questionnaires, such as those for assessing depression or anxiety levels. This ensures more accurate diagnoses and allows for tracking changes in a patient's condition over time.
- Social and psychological research evaluation. IRT is used for analyzing survey data in sociology and psychology, enabling the study of attitudes, opinions, and behaviors with greater precision and reliability.
How is the IRT metric calculated?
The calculation of metrics within the framework of Item Response Theory (IRT) depends on the chosen IRT model, as there are several types of models, each of which may be better suited for certain types of data or research goals. The main steps in the calculation include estimating item parameters and respondent ability parameters.
Choosing an IRT model:
- 1PL (Rasch model) assumes that all items have the same discrimination and evaluates only item difficulty and respondent ability parameters.
- 2PL includes the evaluation of both item difficulty and item discrimination.
- 3PL additionally accounts for the guessing parameter, allowing some items to have a greater chance of being answered correctly by guessing.
Estimating item parameters:
For each item, its parameters (difficulty, discrimination, guessing probability) are estimated based on respondents' answers. This is typically done using maximum likelihood methods or Bayesian estimation methods.
Estimating respondent abilities:
After estimating the item parameters, the ability parameters of the respondents are estimated, usually using maximum likelihood methods, where the value of the respondent’s ability is sought that maximizes the likelihood of their answers to the items.
Model formula:
For the 2PL model, for example, the probability of a correct response (P) from a respondent with ability parameter θ on an item with difficulty parameter b and discrimination parameter a can be expressed as:
P(θ) = 1 / (1 + e-a(θ−b))
For the 3PL model, the guessing parameter c is added, and the formula becomes:
P(θ) = c + (1 − c) / (1 + e-a(θ−b))
Using IRT metrics:
After estimating the item parameters and respondent abilities, the data can be used for analyzing the quality of items, comparing respondents, adaptive testing, and other purposes.
Example:
Let’s consider a simple example of calculating in the context of IRT, using the two-parameter logistic model (2PL). In this model, the probability of a correct response to an item depends on two item parameters — its difficulty (b) and discrimination (a) — as well as the respondent's ability (θ). The formula for the probability of a correct response (P) in the 2PL model looks like this:
P(θ) = 1 / (1 + e-a(θ−b))
Suppose we have an item with difficulty parameter b = -1 and discrimination parameter a = 1.5. We want to calculate the probability of a correct answer for a respondent with an ability of θ = 0.5.
Substituting the values into the formula, we get:
P(0.5) = 1 / (1 + e-1.5(0.5 − (−1)))
The probability that a respondent with an ability of 0.5 will answer an item with difficulty -1 and discrimination 1.5 correctly is approximately 0.905 or 90.5%. This means that the item is relatively easy for a respondent with this level of ability.
Estimating parameters in IRT requires complex statistical methods and software capable of handling large volumes of data and performing complex calculations.