### Overview of IRT

While the RCIs described above have represented a significant advance in establishing reliable change, they remain subject to some important limitations. First, these RCIs were developed and have been most commonly used with ‘observed’ scores, i.e. (typically unweighted) composites (averages or sums) of the set of items in a particular assessment instrument, in the numerator. When unweighted sum scores are used, they fail to fully exploit the information available from the items in assessments. In particular, they fail to take into account the fact that not all items in an assessment are equally strongly related to the attribute that they are seeking to measure and, similarly, that they are not all equally likely to be endorsed (in symptom inventories) or answered correctly (in cognitive tests). As a result, estimates of symptoms and cognitive performance tend not to be as accurate as they could be if information regarding these item properties were taken into account. It leads to situations where, for example, two individuals differing in symptom severity could be assigned the same overall score if they endorsed the same number of symptoms, even if the symptoms endorsed by one individual tended to be more impairing or indicative of a later disease stage than the other. Similarly, the use of unweighted composite scores assumes that the measurement properties of the scores are identical across time (‘longitudinal invariance’; e.g. see [36]), where, in fact, they may change as a result of developmental processes such as ageing or previous test administration (e.g. [37]).

A second major shortcoming of the traditional RCIs reviewed above is that the measurement errors of the scores used in the denominator are estimated assuming that scores are equally reliable irrespective of a person’s level on the attribute measured by the assessment. This is important because assessments cannot be expected to be equally reliable across the full range of attribute levels that they measure, owing to the fact that the items within assessments will have a peak range of attribute levels at which they can discriminate. Assessments designed for screening or diagnostic purposes are likely, for example, to have a peak reliability around diagnostic cut-off points [38]. Thus, the score estimate for individuals scoring far away from this cut-point would be estimated with less precision than those scoring just above or below this cut-point. Where score reliability differs depending on symptom severity (e.g. [12]), traditional RCIs will not provide the individual-level calibration of change thresholds required to accurately capture reliable change.

The above issues are in principle addressed by using an IRT approach to reliable change [11, 39]. IRT models (see [40] for an introduction) are latent variable models that link observed item responses to latent unobserved attributes (e.g. cognitive ability, depression, quality of life). IRT models come in various forms but a commonly used form is the 2-parameter logistic (2PL) model for items with a binary response format. The 2PL links the probability of endorsing an item/answering it correctly to underlying attribute levels using a logistic model:

$$ P\left(Y=1|\theta \right)=\frac{\exp \left[\alpha \left(\theta -\beta \right)\right]}{1+\exp \left[\alpha \left(\theta -\beta \right)\right]} $$

(4)

*P*(*Y* = 1| *θ*) is the probability of endorsing an item given a person’s latent attribute level, *θ*. In addition, exp.(.) refers to the exponential function, *α* is an item discrimination parameter, and *β* is an item location (also referred to as difficulty) parameter.

Item discrimination captures the strength of relation between an item and an underlying attribute measured by a test. For items with high discrimination, item scores rise more sharply with increases in attribute levels than items with low discrimination. Higher discrimination items are thus more informative about attribute levels. Item location captures the position on the latent attribute scale that a majority of individuals endorse or correctly answer the item.

These two item properties can be illuminated by examining item characteristic curves (ICC) from the 2PL. When plotted, the ICCs show the probability of endorsing an item at different attribute levels. Figure 1 shows the ICCs for three items differing in their discrimination and location parameters.

The *x*-axis shows the latent attribute scale *θ* scale. It is simplest to think of these as items of a cognitive test. Zero on the *θ* scale represents average cognitive ability. Negative numbers are below average *θ* levels and positive numbers are above average *θ* levels. The position of the curves on the *x*-axis is determined by the location parameter. Here, the item shown with the solid line would be the easiest item—that is, individuals with lower levels of *θ* have a greater than chance (*y*-axis probability = 0.5) of getting the question correct. As we move to the right, items become harder. As such, the dashed line represents an item with the highest location (difficulty) parameter. The steepness of the lines is determined by the discrimination parameters. Here, items depicted with the solid and dotted lines have the same discrimination, but different location parameters. These items have a steeper curve, indicating that they discriminate better between individuals with similar *θ* values when compared to the third item (dashed line), which has a shallower curve.

Item characteristic curves can be summed to obtain test characteristic curves. Test characteristic curves show the relationship between the underlying attribute levels and the expected total scores on a given test and thus are useful for placing underlying levels of an attribute on the scale of the original assessment.

When extending the 2PL model to longitudinal data across two time points, it is possible to use IRT models to examine intra-individual change, analogous to approaches that have been suggested using CTT scores [41]. A first important step is to evaluate longitudinal measurement invariance. Tests of measurement invariance assess whether the measurement properties of a test are the same across time. In the context of the 2PL, these properties are the item difficulty and discrimination parameters. If measurement invariance can be established, then it can be assumed that the latent construct is equivalent at both points in time. If invariance does not hold, it is not clear that the test measures the construct in the same way at both time points. This is an important concept when studying change. If it is not clear that measurement is equivalent over time, then it is impossible to establish if any observed difference in scores reflects genuine change rather than changes in measurement. It is important to note that whenever a test of change is conducted on a simple (or weighted) sum score, measurement invariance is assumed, but not tested.

In fact, there are reasons to believe that measurement invariance over time is likely to be violated. Violations of longitudinal invariance are reasonably common in other domains (e.g. mental health) due to developmental changes in social contexts and brain development [37] and it is quite conceivable that this would also be true in ageing. For example, minor memory problems could become more noticeable to older adults if they become more attuned to signs of cognitive decline compared to their younger self, leading to differences in reporting of symptoms even in the absence of true change. Further, in the context of cognitive tests, the same problems can sometimes be solved via different strategies and older adults may shift strategies to compensate for declines in particular domains. If different abilities are drawn on to different extents to solve cognitive tests, this could also lead to violations of invariance.

Longitudinal invariance can be evaluated by comparing a set of nested models, where constraints to item location and discrimination parameters are added in sequence. In these models, correlated residuals or specific factors should be included to account for the fact that repeated measures of items will correlate with one another over and above their correlation due to their common relation with the underlying attribute being measured. In order to assess whether invariance holds, model fit comparisons are made between models with and without equivalence constraints. If model fit decreases significantly with the addition of invariance constraints, it would be concluded that invariance does not hold [42].

It is not necessary for all items to have invariant discrimination and location parameters across time provided that a subset of items (at least two but ideally more) are invariant and that the lack of invariance is modelled [43, 44]. In fact, from this point of view, baseline and retest scores need not be based on the exactly same set of items as long as a small core of items can be used to provide an ‘anchor’ that links items on to the same scale. Thus, IRT provides a framework for testing longitudinal invariance and accommodating violations of this assumption.

### IRT-based RCIs

Individual-level scores can be obtained from IRT models by treating the parameters of the model (the discrimination, difficulty, and trait correlations) as fixed and estimating scores based on these in a manner conceptually similar to deriving predicted scores from a regression model. There are several ways to estimate individual-level scores from an IRT model (see [45] for a discussion). First, they can be obtained using maximum likelihood estimation, which involves an iterative search for the set of *θ* scores that maximises the likelihood function (the product of the probabilities of all item responses). An issue with this method is that latent trait scores are not defined for some patterns of scores, for example, when an individual correctly answers all test items. Practically, such a scenario would be common when cognitive healthy individuals complete assessments such as the MMSE.

An alternative approach that resolves this issue is Bayesian estimation, where Bayesian prior information about the latent trait distribution in the population is incorporated into the estimation and a posterior distribution is formed as the product of this prior distribution and the likelihood function. The multivariate standard normal distribution is often used as the prior distribution. Methods for computing scores within this Bayesian approach include expected a posteriori (EAP) and maximum a posteriori (MAP). In these approaches, individual scores are the mean (EAP) and mode (MAP) of the posterior distribution [45].

The appropriate method of computing standard errors of measurements for individual-level scores depends on the method chosen for estimating the scores. For EAP, the standard deviation of the posterior distribution is used. For ML and MAP, they are computed as the inverse of the ‘information’ for the attribute level. Information is an IRT equivalent of reliability/precision. A distinctive and crucial feature of information and thus standard errors from an IRT perspective is that it can vary with attribute level, thus allowing for the standard error of measurement to be calibrated to an individual’s specific level.

From the IRT score estimates and their standard errors computed as described above, an IRT-based RCI can be formed. For example, Jabrayilov et al. (2016) suggests the following RCI:

$$ RCI=\frac{{\hat{\theta}}_2-{\hat{\theta}}_1}{\sqrt{SE{\left({\hat{\theta}}_1\right)}^2+ SE{\left({\hat{\theta}}_2\right)}^2}} $$

(5)

where \( {\hat{\theta}}_1 \) and \( {\hat{\theta}}_2 \) represent estimates of latent scores at baseline and retest respectively and \( SE\left({\hat{\theta}}_1\right) \) and \( SE\left({\hat{\theta}}_2\right) \) are the associated standard errors of the scores. Reise and Haviland [39] suggested a similar method whereby the 95% confidence intervals around baseline scores are calculated and a reliable change defined as occurring when the follow-up score falls outside this interval.

### Practice effects within an IRT framework

Practice effects in an IRT framework will manifest as violations of measurement invariance in the location parameters over time where items become easier to answer correctly on a second administration. Two primary solutions have been proposed [43]. First, if there is a subset of items that are resistant to practice effects over time and thus demonstrate longitudinal invariance, these items can have their parameters fixed equal over time. The remaining item parameters can vary over time to capture the effects of practice. Second, different sets of anchor items could be administered over time but their parameters fixed to known values estimated from previous studies. An appropriate reference sample can be used to estimate the item parameters. If neither approach is feasible, we suggest applying a correction directly to the attribute scores where some previous information is available on practice effects. Known practice effects on the raw scale score can be converted to the IRT score scale through the test characteristic curve. The test characteristic curve is the sum of item characteristic curves and links latent attribute levels to total scores on the test.