Predicting Alzheimer's disease development: a comparison of cognitive criteria and associated neuroimaging biomarkers

Introduction The definition of “objective cognitive impairment” in current criteria for mild cognitive impairment (MCI) varies considerably between research groups and clinics. This study aims to compare different methods of defining memory impairment to improve prediction models for the development of Alzheimer’s disease (AD) from baseline to 24 months. Methods The sensitivity and specificity of six methods of defining episodic memory impairment (< −1, −1.5 or −2 standard deviations [SD] on one or two memory tests) were compared in 494 non-demented seniors from the Alzheimer’s Disease Neuroimaging Initiative using the area under the curve (AUC) for receiver operating characteristic analysis. The added value of non-memory measures (language and executive function) and biomarkers (hippocampal and white-matter hyperintensity volume, brain parenchymal fraction [BPF], and APOEε4 status) was investigated using logistic regression. Results Baseline scores < −1 SD on two memory tests predicted AD with 75.91 % accuracy (AUC = 0.80). Only APOE ε4 status further improved prediction (B = 1.10, SE = 0.45, p = .016). A < −1.5 SD cut-off on one test had 66.60 % accuracy (AUC = 0.77). Prediction was further improved using Trails B/A ratio (B = 0.27, SE = 0.13, p = .033), BPF (B = −15.97, SE = 7.58, p = .035), and APOEε4 status (B = 1.08, SE = 0.45, p = .017). A cut-off of < −2 SD on one memory test (AUC = 0.77, SE = 0.03, 95 % CI 0.72-0.82) had 76.52 % accuracy in predicting AD. Trails B/A ratio (B = 0.31, SE = 0.13, p = .017) and APOE ε4 status (B = 1.07, SE = 0.46, p = .019) improved predictive accuracy. Conclusions Episodic memory impairment in MCI should be defined as scores < −1 SD below normative references on at least two measures. Clinicians or researchers who administer a single test should opt for a more stringent cut-off and collect and analyze whole-brain volume. When feasible, ascertaining APOE ε4 status can further improve prediction.


Introduction
Patients with mild cognitive impairment due to Alzheimer's disease (MCI) [1] -also known as mild neurocognitive disorder [2] -are considered to be at an early stage of dementia. There are now multiple published criteria sets for identifying these individuals at high risk of progression [1][2][3], all of which include at least: 1) subjective concern; 2) an objective cognitive impairment on formal neuropsychological testing in one or more cognitive domains, typically including memory; 3) preservation of functional independence; and 4) no dementia.
Although these criteria have been a major step forward in the conceptualization of MCI, they leave room for considerable ambiguity, particularly regarding the operational definition of objective cognitive impairment. A number of cognitive tests have been proposed that may be useful for identifying objective episodic memory impairment in MCI, specifically measures that assess both immediate and delayed recall, such as word-list learning or paragraph recall [1,4]. These suggestions are very useful in providing common ground for clinicians and researchers working with MCI cohorts. However, three critical issues remain.
First, it is unclear which cutoff scores should be used to define impairment. Studies examining MCI patients typically report test performance in the range of one to two standard deviations (SD) below age-adjusted and/or education-adjusted norms. However, using a −1 SD cutoff may be overly inclusive, as cognitive performance in healthy older adults often falls below this limit [5] for a variety of non-pathological reasons (e.g., fatigue, anxiety). Conversely, using a −2 SD cutoff may underestimate the number of individuals who are in the earliest phases of the disease process.
Second, it is unclear how many measures should be used in assessing cognition. In memory clinics, diagnosis is typically based on results of a battery of neuropsychological tests including more than one test probing the same cognitive domain. Longitudinal evidence confirms that using at least two tests to establish impairment greatly increases diagnostic accuracy [6]. In research settings, however, MCI diagnosis is often based on a single test. This is potentially problematic, as research has shown that more than one quarter of healthy elderly adults who are tested using a single memory measure obtain scores in impaired ranges (< −1.5 SD), while this number is reduced to 14.1 % when a second test is added [5]. As mentioned above, impaired performance on a single test in otherwise healthy normal adults may be explained by numerous factors such as anxiety, depression, fatigue, or inattention. Thus, this single-test procedure may not be adequate for identifying individuals who are at highest risk of dementia.
Third, it is unclear which cognitive domain(s) should be assessed, if any, in addition to episodic memory. Originally, Petersen's [3] diagnostic criteria recommended that a distinction be made between single-domain and multiple-domain MCI, with the assumption that this classification would be of heuristic value in determining the probable etiology of the disorder. This recommendation is echoed in Albert and colleagues' [1] revised criteria as well. Indeed, some longitudinal evidence suggests that these subtypes evolve differently over time [7], suggesting distinct etiological processes. However, the most recent DSM-5 criteria for mild neurocognitive disorder [2] do not discriminate between single-domain and multiple-domain cognitive impairment. Many research studies also do not make this distinction.
In addition, recent guidelines for diagnosing MCI have emphasized the importance of using genetic and imaging biomarkers in addition to neuropsychological testing. The presence of one or two copies of the epsilon 4 allele (ε4) in the apolipoprotein E (APOE) gene is one commonly accepted genetic characteristic believed to increase the risk of development of dementia due to Alzheimer's disease (AD) [8]. Additionally, metrics obtained from structural magnetic resonance imaging (MRI) that assess neuronal injury, such as total brain atrophy [9,10], ventricular enlargement [11][12][13], hippocampal (HP) volume loss [14,15], medial temporal lobe atrophy [16], and possibly the presence of small vessel disease [17], may be informative predictors for the development of AD dementia.
Using data obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI), the purpose of this study is to determine whether prediction of development of clinical dementia among non-demented participants is improved by: 1) using cutoff scores of −1.0, −1.5 or −2.0 SD to define cognitive impairment; 2) assessing episodic memory using one or two tests; 3) assessing additional non-memory domains; and 4) accounting for commonly used neuroimaging and genetic biomarker data. It was hypothesized that the identification of individuals at risk for the development of dementia would best be predicted by defining objective impairment as performance < −1 SD on two episodic memory tests. Furthermore, it was anticipated that the ability to predict the development of AD would be further optimized by considering performance in at least one other, non-memory domain. Finally, it was expected that the inclusion of imaging and genetic biomarkers known to be associated with AD would further improve prediction.

Materials and methods
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu) on 3 February 2015. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD.

Participants
Of the 819 participants enrolled in ADNI-1, those who had neuropsychological and genetic data available at baseline and 24-month follow up were selected for this study (n = 630). A 24-month follow-up period was selected to maximize statistical power and to ensure that harmonized imaging outcome measures were available for the majority of the sample. Of these 819 participants, those with a diagnosis of probable AD at baseline were excluded (n = 136). Individuals with a history of neurological or psychiatric illness or substance abuse, or without a study partner able to corroborate reports of functioning, were not eligible for ADNI; complete eligibility criteria for the ADNI study as a whole are described at http://adni.loni.usc.edu/wp-content/uploads/ 2010/09/ADNI_GeneralProceduresManual.pdf. The final sample consisted of the remaining 494 non-demented participants. According to the assigned diagnoses in the ADNI database, 294 of these participants were classified as having MCI, and the remaining 200 were classified as cognitively normal. All participants (201 women, 293 men) were 55-89 years old at baseline (mean = 75.3 ± 6.4) and had 6-20 years of education (mean = 15.9 ± 2.8).

Cognitive measures
A neuropsychological battery was administered to all participants upon admission to ADNI, and raw scores were downloaded from the ADNI Neuropsychological Battery table. Of interest in the present study are tests that measure general cognition (Mini-mental state exam (MMSE)), episodic memory (Logical memory story A delayed recall (LM-II), Rey auditory verbal learning test (AVLT)), language (Category fluency, Boston naming test (BNT)) and executive functioning (Trails A and B). A derived Trails B/Trails A ratio was calculated to obtain a relatively independent measure of executive control, as has been suggested by other authors [18]. Raw scores were transformed to standardized scores (z scores or scaled scores (SS)) based on published age-adjusted norms for the AVLT [19], Category fluency [20], BNT [21] and Trails A & B [18]. Education-adjusted z scores for LM-II story A were obtained using a web-based calculator [22] based on data from a large published report [23]. Higher z scores or SS represent better performance, with the exception of Trails A and B in which higher scores represent poorer performance (i.e., longer time to complete the test).

Imaging and genetic biomarkers
Neuroimaging-based biomarkers were obtained from downloaded ADNI database tables (hierarchical parcellation of MRI using multi-atlas labeling methods (UPENN); white matter hyperintensity volumes (UCD)). Whole brain atrophy was assessed using the brain parenchymal fraction (BPF), which was calculated as a ratio of total parenchymal volume (gray matter (GM) and white matter (WM)) to total cranial vault (TCV) volume as follows: To assess medial and focal atrophy, head-size-corrected ventricular cerebrospinal fluid (vCSF) and HP volume were automatically segmented using previously published and validated methods [11,14]. Small vessel disease burden was assessed using whole brain white matter hyperintensity (WMH) volumes [25]. Full segmentation methodological details can be obtained from ADNI (see ADNI1_Methods_UCD_WMH_Volumes_Methods.pdf and ADNI_Total_Cranial_Vault_Segmentation_Method_ 20121108.pdf). In addition, the presence of one or two copies of the APOE ε4 allele was determined for all participants as per standard ADNI protocol.

Statistical analyses
Six binary variables were created based on scores < −1.0, −1.5 or −2.0 SD on one (LM-II or AVLT delayed recall) or two (LM-II and AVLT delayed recall) memory tests, and participants were classified as above or below each cutoff. The predictive accuracy of these six cutoffs was tested using the area under the curve (AUC) for receiver operating characteristic (ROC) analysis. The minimum value for an AUC to be considered clinically significant was >0.75 [26]. Hanley and McNeil's [27] method was used to test for statistical differences between AUC values. Cutoff scores with AUC values >0.75 were then entered into separate binary logistic regression analyses with hierarchical designs, with probable AD at 24 months as the binary (yes/no) dependent variable. In all models, age, sex, education, MMSE and the selected cutoff score were entered in a first block. A second block included performance on non-memory cognitive measures, specifically standardized Category fluency, BNT, and Trails B/A-derived scores. A third block assessed the potential added predictive value of biomarkers that are known to be associated with probable AD: BPF, vCSF volume, total HP volume, WMH volume, and APOE ε4 status. We verified that all variables met multicollinearity and linearity assumptions.
Last, in order see whether participants whose performance fell above and below the best selected cutoff scores were phenotypically different, multivariate analysis of covariance (MANCOVA) was used to compare cognitive and neuroimaging characteristics between these two groups, with age, sex and education entered as covariates. Highly skewed variables exhibiting non-normal distributions were log-transformed (WMH, vCSF) or inverse-transformed (Trails B/A ratio) prior to analysis. Category fluency scores did not meet the equal variance assumption and were therefore log-transformed.
Dichotomous variables were compared using the chisquare test. Seven participants were excluded from subsequent analyses because they had missing data (two had missing WMH data, two had missing Trails B data, one had missing BNT data, and two had missing Trails B and BNT data). First, on logistic regression model to test the added value of non-memory measures and biomarkers, in addition to a cutoff of <  (Table 3).

Results
Participants who scored above (n = 291) and below (n = 196) a cutoff score of < −1 SD on two memory tests were compared using MANCOVA. Levene's test indicated that both groups had equal variances (all variables p >0.05). As summarized in Table 4, it was found that those with episodic memory scores below the cutoff had poorer performance on Category fluency   Participants who scored above (n = 223) and below (n = 264) a cutoff score of < −1.5 SD on one memory test were compared in a second MANCOVA. Two variables violated Levene's test (Trails B/A ratio and left HP volume), likely due to the large sample sizes. Inspection of the data showed that the variance between both groups was highly similar (in the abovecutoff and below-cutoff groups, the respective variances were 0.010 and 0.016 for Trails B/A ratio, and 0.001 and 0.001 for left HP volume), and therefore parametric analyses were retained. Results revealed that individuals with episodic memory scores below the cutoff had poorer performance on Participants who scored above (n = 313) and below (n = 174) a cutoff score of <2 SD on one memory test were compared in a third MANCOVA. Trails B/A ratio violated Levene's test of equality of error variances, but again inspection of the data showed highly similar variances between the above-cutoff (0.010) and below-cutoff (0.016) groups. Parametric analyses were thus retained. Individuals with episodic memory scores below the cutoff had poorer performance on Category fluency (F (4,482) = 11.61, p <0.

Discussion
This study aimed to assess how various cognitive, neuroimaging and genetic measures collected at baseline can be used to predict the development of probable AD dementia at 24 months in a sample of elderly participants obtained from ADNI. By assessing a series of normative cutoff scores from cognitive test results, the number of episodic memory and nonmemory tests used to assess cognitive performance, and other commonly used neuroimaging and genetic biomarkers, a set of recommended criteria was established which may be used in future investigations to improve prediction for the development of probable AD in the elderly. Consistent with our initial hypotheses, performance < −1 SD on two memory tests (LM-II and AVLT delay) had the best trade-off between sensitivity and specificity for predicting probable AD, followed by performance < −1.5 SD and < −2 SD on one memory test (LM-II). These results suggest that to maximize diagnostic certainty, a minimum of two measures should ideally be used to assess episodic memory performance and impairment should be defined as scores at least 1 SD below appropriate normative references on both measures. Jak and colleagues [28] were among the first to recommended establishing impairment on at least two measures within a cognitive domain as the best way to increase sensitivity while maintaining reliability, and other authors have since corroborated the value of this approach [6,[29][30][31]. Our results further indicate that clinicians or researchers with limited resources who administer only a single memory test should opt for a much more stringent cutoff (i.e., −2 SD below normative reference data) to determine episodic memory impairment with comparable accuracy to two measures. Applying a −1.5 SD cutoff to a single test should be avoided when possible, as it remains highly For z scores, the mean is 0 and the standard deviation is 1. For scaled scores, the mean is 10 and the standard deviation is 3. Data are missing for seven participants.
BNT Boston naming test, BPF brain parenchymal fraction, HP hippocampal volume, vCSF ventricular cerebrospinal fluid, WMH white matter hyperintensities prone to false positive diagnostic errors (c.f. [30,31]) which reached nearly one-third of the sample (32.6 %) in the present study. The only variable that improved prediction above and beyond episodic memory testing using two measures was APOE status, consistent with previous research recognizing APOE ε4-positive status as a major risk factor for subsequent AD (see [32] for a review). When only one test was used to assess episodic memory, prediction of dementia was improved using a non-memory test, specifically the ratio of Trails B/A, considered to be a measure of executive control [18]. Predictive accuracy was further increased using APOE ε4 status and wholebrain atrophy (as indexed by brain parenchymal fraction). These interesting results suggest that thorough episodic memory testing using several measures is successful in predicting subsequent dementia with at least as much accuracy as using one memory test plus additional memory tests and biomarkers. It has previously been reported that the use of sensitive neuropsychological instruments are at least as effective in predicting AD as imaging biomarkers [33][34][35][36]. Other authors have also reported that the use of a single memory test is not optimal in predicting AD, and that adding information on brain atrophy and/or cerebrospinal fluid biomarkers is necessary to improve predictive accuracy in regression models [35,37,38]. We corroborate these findings, and extend them to specify that "impairment" should be defined as performance more than 1 SD below normative data.
Certain limitations must be considered in interpreting these data. First, the ADNI study specifically set out to recruit patients who represented relatively pure cases of MCI and dementia of the Alzheimer's type, who are appropriate for clinical trials; this is evident in patients' relatively low burden of WMH [39] (thought to reflect underlying vascular disease [40]). As such, the sample primarily includes individuals whose suspected etiology is AD, and whose primary (and often only) cognitive deficit involves memory. While ADNI provides a large and rich database to study individuals who are at high risk of developing AD, findings generated from these data have limited generalizability to real-world patient populations [39]. Other, more inclusive cohorts of individuals with MCI are needed. In addition, the standardized scores used in this study were derived from published age-adjusted norms for each test. It is possible that the use of local norms may produce different results (e.g., see [41]).
We have shown that diagnostic accuracy can be improved by approximately 10 % by administering an extra memory test to evaluate memory capacities in persons suspected of MCI. This improved accuracy is mostly the result of reducing false positive results, which other authors have shown are inflated when using a single test [31]. Although adding a test to the diagnostic battery resulted in some patients being missed at baseline, who went on to develop AD at 24 months, our findings suggest that this trade-off is altogether fair. An incorrect diagnosis of AD has serious implications for research and clinical practice. First, studies that employ only LM-II to test for memory impairment in participants are effectively pooling true MCI cases with those who are likely cognitively normal, thus potentially weakening the robustness of the research findings and limiting their generalizability. Clinically, the consequences of an incorrect diagnosis include needless testing, pharmacotherapy, and anxiety incurred by the patient and family. Also, inaccurate diagnosis implies that alternative (potentially reversible) causes of cognitive changes are not being investigated.
In closing, we must acknowledge that expanding cognitive batteries to include an extra memory test has some disadvantages. Namely, more clinician time and additional test materials are required, and research protocols will be slightly lengthened. However, we believe that these caveats are greatly outweighed by the benefit of improved accuracy, and that an additional memory measure should be added to clinical and research cognitive batteries to the extent that it is feasible.

Conclusions
The findings of our study in the ADNI cohort suggest that neuropsychological testing can predict decline with high accuracy regardless of biomarkers, when memory is assessed using delayed recall of a short story and a word list, using a cutoff of < −1 SD below normative references. This criterion provides the optimal trade-off between specificity and sensitivity for predicting conversion to AD at two years. The increased accuracy that this criterion provides decreases the probability of misdiagnosing a patient and avoids needless testing, pharmacotherapy and anxiety, and provides a high-accuracy, low-cost strategy for identifying individuals at highest risk of dementia. In situations where it is only feasible to administer a single memory test, collecting information on non-memory performance and imaging or genetic biomarkers is necessary to optimize diagnostic accuracy.