Detection of dementia on voice recordings using deep learning: a Framingham Heart Study
Alzheimer's Research & Therapy volume 13, Article number: 146 (2021)
Identification of reliable, affordable, and easy-to-use strategies for detection of dementia is sorely needed. Digital technologies, such as individual voice recordings, offer an attractive modality to assess cognition but methods that could automatically analyze such data are not readily available.
Methods and findings
We used 1264 voice recordings of neuropsychological examinations administered to participants from the Framingham Heart Study (FHS), a community-based longitudinal observational study. The recordings were 73 min in duration, on average, and contained at least two speakers (participant and examiner). Of the total voice recordings, 483 were of participants with normal cognition (NC), 451 recordings were of participants with mild cognitive impairment (MCI), and 330 were of participants with dementia (DE). We developed two deep learning models (a two-level long short-term memory (LSTM) network and a convolutional neural network (CNN)), which used the audio recordings to classify if the recording included a participant with only NC or only DE and to differentiate between recordings corresponding to those that had DE from those who did not have DE (i.e., NDE (NC+MCI)). Based on 5-fold cross-validation, the LSTM model achieved a mean (±std) area under the receiver operating characteristic curve (AUC) of 0.740 ± 0.017, mean balanced accuracy of 0.647 ± 0.027, and mean weighted F1 score of 0.596 ± 0.047 in classifying cases with DE from those with NC. The CNN model achieved a mean AUC of 0.805 ± 0.027, mean balanced accuracy of 0.743 ± 0.015, and mean weighted F1 score of 0.742 ± 0.033 in classifying cases with DE from those with NC. For the task related to the classification of participants with DE from NDE, the LSTM model achieved a mean AUC of 0.734 ± 0.014, mean balanced accuracy of 0.675 ± 0.013, and mean weighted F1 score of 0.671 ± 0.015. The CNN model achieved a mean AUC of 0.746 ± 0.021, mean balanced accuracy of 0.652 ± 0.020, and mean weighted F1 score of 0.635 ± 0.031 in classifying cases with DE from those who were NDE.
This proof-of-concept study demonstrates that automated deep learning-driven processing of audio recordings of neuropsychological testing performed on individuals recruited within a community cohort setting can facilitate dementia screening.
Impairment in cognition is a common manifestation among all the individuals suffering from dementia, of which Alzheimer’s disease (AD) is the most common. Despite the rising dementia epidemic accompanying a rapidly aging population, there remains a paucity of cognitive assessment tools that are applicable regardless of age, education, and language/culture. Thus, there is an urgent need to identify reliable, affordable, and easy-to-use strategies for the detection of signs of dementia. Starting in 2005, the Framingham Heart Study (FHS) began digitally recording all spoken responses to neuropsychological tests. These digital voice recordings allow for precise capture of all responses for verbal tests. Accurate documentation of these studies allowed for novel application of the Boston Process Approach (BPA) , a scoring method that emphasizes how a participant makes errors to differentiate between participants with similar test score results. With the emergence of speech recognition and analysis tools, there was a realization that these recordings used for quality control were now data in of themselves because speaking is a complex cognitive skill. Virtually all voice recognition software requires speech-to-text transcription from which to extract linguistic measures of speech, and manual transcription is often required to reach high levels of accuracy. This manual work not only is tedious, but also to ensure high levels of accuracy often requires some level of training in speech-to-text transcription and quality control to document transcription accuracy. Such expertise is not readily available at all locations around the globe. Developing a computational tool that can simply take a voice recording as an input and automatically assess the dementia status of the individual has broad implications for dementia screening tools that are scalable across diverse populations.
Promising solutions to tackle such datasets may be found in the field of deep learning, an approach to data analysis that has increasingly been used over the past few years to address an array of formerly intractable questions in medicine. Deep learning , a subfield of machine learning, is based upon specific models known as neural networks which decompose the complexities of observed datasets into hierarchical interactions among low-level input features. Once these interactions are learned, refined, and formalized by exposure to a training dataset, fully trained deep learning models leverage their “experience” of prior data to make predictions about new cases. Thus, these approaches offer powerful medical decision-making potential due to their ability to rapidly identify low-level signatures of disease from large datasets and quickly apply them at scale. This hierarchical approach makes deep learning ideally suited to derive novel insights from high volumes of audio/voice data. Our primary objective was to develop a long short-term memory network (LSTM) model and a convolutional neural network (CNN) model, to predict dementia status on the FHS voice recordings, without manual feature processing of the audio content. As a secondary objective, we processed the model-derived salient features and reported the distribution of time spent by individuals on various neuropsychological tests and these tests’ relative contributions to dementia assessment.
The voice data consists of digitally recorded neuropsychological examinations administered to participants from the Framingham Heart Study (FHS), which is a community-based longitudinal observational study that was initiated in 1948 and consists of several generations of participants . FHS began in 1948 with the initial recruitment of the Original Cohort (Gen 1) and added the Offspring Cohort in 1971 (Gen 2), the Omni Cohort in 1994 (OmniGen 1), a Third Generation Cohort in 2002 (Gen 3), a New Offspring Spouse Cohort in 2003 (NOS), and a Second Generation Omni Cohort in 2003 (OmniGen 2). The neuropsychological examinations consist of multiple tests that assess memory, attention, executive function, language, reasoning, visuoperceptual skills, and premorbid intelligence. Further information and details can be found in Au et al. , including the lists of all the neuropsychological tests included in each iteration of the FHS neuropsychological battery .
Cognitive status determination
The cognitive status of the participants over time was diagnosed via the FHS dementia diagnostic review panel [6, 7]. The panel consists of at least one neuropsychologist and at least one neurologist. The panel reviews neuropsychological and neurological exams, medical records, and family interviews for each participant. Selection for dementia review is based on whether participants have shown evidence of cognitive decline, as has been previously described .
The panel creates a cognitive timeline for each participant that provides the participant’s cognitive status on a given date over time. To label the participants’ cognitive statuses at the time of each recording, we selected the closest date of diagnosis to the recording that fell on or before the date of recording or after the recording within 180 days. If the closest date of diagnosis was more than 180 days after the recording, but the participant was determined to be cognitively normal on that date, we labeled that participant as cognitively normal. Dementia diagnosis was based on criteria from the Diagnostic and Statistical Manual of Mental Disorders, fourth edition (DSM-IV) and the NINCDS-ADRDA for Alzheimer’s dementia . The study was approved by the Institutional Review Boards of Boston University Medical Center and all participants provided written consent.
Digital voice recordings
FHS began to digitally record the audio of neuropsychological examinations in 2005. Overall, FHS has 9786 digital voice recordings from 5449 participants. It must be noted that not all participants underwent dementia review and repeat recordings were available on some participants. For this study, we selected only those participants who underwent dementia review, and their cognitive status was available (normal/MCI/dementia). The details of how participants are flagged for dementia review have been previously described . In total, we obtained 656 participants with 1264 recordings, which are 73.13 (± 22.96) min in duration on average. The range is from 8.43 to 210.82 min. There are 483 recordings of participants with normal cognition (NC; 291 participants), 451 recordings of participants with mild cognitive impairment (MCI) (309 participants), 934 recordings of non-demented participants (NDE; 507 participants), and 330 recordings of participants with dementia (DE; 223 participants) (Tables 1 and S1). Of the 656 participants, one participant may have several recordings with different cognitive statuses. For example, one participant could have NC at their first recording, MCI at their second recording, and DE at their third recording. This implies that participants with a recording associated with either NC, MCI, or DE are not mutually exclusive and will not necessarily add up to the overall 656 participants. The recordings were obtained in the WAV format and downsampled to 8 kHz.
We observed that the FHS participants on average spent different amounts of time to complete specific neuropsychological tests, and these times varied as a function of their cognitive status for most of the tests (Fig. 1). For example, almost all participants spent the highest amount of time while completing the Boston Naming Test (BNT). During this test, participants who were DE spent significantly more time (611.1 ± 260.2 s) than the participants who are not demented (NDE; 390.7 ± 167.4 s). This observation was also valid when the time spent by participants with DE was compared specifically with those with NC (405.9 ± 176.8 s), or with the ones who had MCI (321.2 ± 93.8 s). However, there was no statistical significance between the times taken for BNT by participants with NC and MCI. A similar pattern was observed for a few other neuropsychological tests including “Visual Reproductions Delayed Recall,” “Verbal Paired Associates Recognition,” “Copy Clock,” “Trails A,” and “Trails B” (Table 2). We also observed no statistically significant differences in the times taken for the participants with NC, MCI, DE, and NDE on a few other neuropsychological tests including “Logical Memory Immediate Recall”, “Digit Span Forward,” “Similarities,” “Verbal Fluency (FAS),” “Finger Tapping,” “Information (WAIS-R)”, and “Cookie Theft” (Table 2).
To preserve as much useful information as possible, the Mel-frequency cepstral coefficients (MFCCs) were extracted during the data processing stage. MFCCs are the coefficients that collectively make up the Mel-frequency cepstrum, which serves as an important acoustic feature in many speech processing applications, particularly in medicine [10,11,12,13,14,15]. Leveraging the nonlinear Mel scaling makes the response much closer to the human auditory system and therefore renders it an ideal feature for speech-related tasks. Each FHS audio recording was first split into short frames with a window size of 25 ms (i.e., 200 sample points at 8000-Hz sampling rate) and a stride length of 10 ms. For each frame, the periodogram estimate of the power spectrum was calculated by 256-point discrete Fourier transformation. Then, a filterbank of 26 triangle filters evenly distributed with respect to the Mel scale was applied to each frame. We then applied a discrete cosine transformation of the logarithm of all filterbank energies. Note that the 26 filters correspond to 26 coefficients, but in practice, only the 2nd–13th coefficients are believed to be useful; we loosely followed this convention while replacing the first coefficient with total energy that might contain helpful information on the entire frame.
Hierarchical long short-term memory network model
Recurrent neural networks (RNN) have long been used for capturing complex patterns in sequential data. Because the durations of the FHS recordings averaged more than 1 h and corresponded to hundreds of thousands of MFCCs, a single RNN may not be able to memorize and identify patterns across such long sequences. We therefore developed a two-level hierarchical long short-term memory network (LSTM) model , to associate the voice recordings with dementia status (Fig. 2A). Our previous work has shown how the unique architecture of the LSTM model can tackle long sequences of data . Also, as a popular variation in the RNN family, the LSTM model is well suited to overcoming issues such as vanishing gradients and long-term dependencies compared to other RNN frameworks.
A 1-h recording may yield hundreds of thousands of temporally ordered MFCC vectors, while the memory capacity of the LSTM model is empirically limited to only a few thousand vectors. We first grouped every 2000 consecutive MFCC vectors into segments without overlap. For each segment, the low-level LSTM took 10 MFCCs at a time and moved onto the next 10 MFCCs until the end. We then collected the last hidden state as the low-level feature vector for the segment. After processing all those segments one by one, the collection of the low-level feature vectors formed another sequence, which was then fed into the high-level LSTM to generate a high-level feature vector summarizing the entire recording. Note that the hierarchical design ensured that the two-level LSTM architecture was not overwhelmed by longer sequences beyond their practical limitation. For the last step of the LSTM scheme, a multilayer perceptron (MLP) network was used to estimate the probability of dementia based on the summarized vector. Both the low-level and the high-level LSTM shared the same hyperparameters where the hidden state dimension was 64 and the initial states were all set to zeros. The MLP was combined with a 64-dimensional hidden layer and an output layer along with a nonlinear activation function. The output layer was then followed by a softmax function to project the results onto the probability space.
Convolutional neural network model
For comparison with the LSTM model, we designed a one-dimensional convolutional neural network (CNN) model for dementia classification (Fig. 2B). The stem structure of the CNN model consisted of 7 stacked convolutional blocks. Within each block, there were 2 convolutional layers, 1 max-pooling layer, and 1 nonlinearity function. All the convolutional layers had a filter size of 3, stride size of 1, and a padding size of 1. For the pooling layers in the first 6 blocks, we used max pooling with a filter size of 4 and a stride size of 4. The last layer used global average pooling to tackle audio recordings of variable lengths. By applying global average pooling, all the information gets transformed to a fixed-length feature vector, making it straightforward for classification. The CNN stem structure was then followed by a linear classifier which was comprised of 1 convolutional layer and 1 softmax layer. Note that the filter size of the convolutional layer was set to be the exact size of the output feature vector.
We derived the saliency maps based on the intermediate result right before the global average pooling step of the CNN model. The intermediate result was composed of two vectors which signified DE[+] and DE[−] prediction, respectively. For simplicity, we only used the DE[+] vector. Since the recording-level prediction was determined by the average of the saliency map which also preserved temporal structure, it allowed us to observe finer aspects of the prediction by revealing the values assigned to each short period. For our CNN model settings, each value corresponded to roughly 2 and a half minutes of an original recording. Note that the length of the periods is implied by the CNN model parameters; altering the stride size, the kernel size, the number of stacked CNN blocks, etc. may result in different lengths. To align the saliency map with the original recording for further analysis, we extended its size to the total number of seconds via nearest neighbor interpolation.
Salient administered fractions
The CNN model tasked with classifying between participants with normal cognition and dementia was tested on 81 participants who have transcripts for 123 recordings out of their 183 total recordings. Those 60 recordings without transcripts were excluded from the saliency analysis but were included in the test dataset. Of the recordings with transcripts, there were 44 recordings of participants with normal cognition and 79 recordings of participants with dementia. The recordings that were transcribed were divided into subgroups based on the time spent by the participant on each neuropsychological test. As a result, for each second of a given recording, the DE[+] saliency value and the current neuropsychological test are known. In order to calculate the salient administered fraction (SAF) for a given neuropsychological test, we counted the number of seconds for that neuropsychological test that also had a DE[+] saliency value greater than zero and then divided it by the total number of seconds for that neuropsychological test. For example, if the Boston Naming Test (BNT) has 90 s that had DE[+] saliency values greater than zero and BNT was administered for 100 s, then the SAF[+] would be 0.90. Similarly, we produced SAF[−] in the same way, except for periods of time where the DE[+] saliency value was less than or equal to zero. The SAF[+] was calculated for every administered neuropsychological test for each recording that had a demented participant and that the model classified the participant as demented (true positive). The SAF[−] was calculated in the same way, except for recordings with participants with normal cognition and the model classified to have normal cognition (true negative).
Data organization, model performance, and statistical analysis
The dataset for this study includes digital voice recordings from September 2005 to March 2020 from the subset of FHS participants who came to dementia review. The models were implemented using PyTorch 1.4 and constructed on a workstation with a GeForce RTX 2080 Ti graphics processing unit. The Adam optimizer with learning rate = 1e−4 and betas = (0.99, 0.999) was applied to train both the LSTM and the CNN models. A portion of the participants along with their recordings were kept aside for independent model testing (Figure S1). Note that manual transcriptions were available on these cases, which allowed us to perform saliency analysis. Using the remaining data, the models were trained using 5-fold cross-validation. We split the data on the participant level for each fold and then all of a given participant’s recordings were included in each fold. We acknowledge that this does not exactly split the data into even folds because a participant may have a different number of recordings compared to another participant; however, each participant generally had a similar distribution of recordings, which mitigated this effect.
We also tested the model performance on audio segments of shorter lengths. From the test data, we randomly extracted 5-min, 10-min, and 15-min recordings from the participants and grouped them based on the audio length. Both the LSTM and CNN models trained on the full audio recordings were used to predict on these short audio segments. Note that only one segment (5 min or 10 min or 15 min) was extracted with replacement per recording in one round of testing. This process was repeated five times and results were reported.
We evaluated the differences in means of the durations of all three cognitive statuses using a pairwise t-test. The performance of the machine learning models was presented as mean and standard deviation over the model runs. We generated receiver operating characteristic (ROC) and precision-recall (PR) curves based on the cross-validated model predictions. For each ROC and PR curve, we also computed the area under curve (AUC) values. Additionally, we computed model accuracy, balanced accuracy, sensitivity, specificity, F1 score, weighted F1 score, and Matthews correlation coefficient on the test data.
Both the LSTM and the CNN models that were trained and validated on the FHS voice recordings demonstrated consistent performance across the different data splits used for 5-fold cross-validation (Fig. 3). For the classification of demented versus normal participants, the LSTM model took 20 min to fully train (8 epochs and batch size of 4) and took 14 s to predict on a test case; the CNN model took 106 min to fully train (32 epochs and batch size of 4) and took 13 s to predict on a test case. For the classification of demented versus non-demented participants, the LSTM model took 187 min to fully train (32 epochs and batch size of 4) and took 20 s to predict on a test case; the CNN model took 427 min to fully train (64 epochs and batch size of 4) and took 20 s to predict on a test case. In general, we observed that the CNN model tested on the full recordings performed better on most metrics for the classification problem focused on distinguishing participants with DE from those that have NC (Table 3, Figure S). The only exception was that the LSTM model mean specificity was higher than the mean specificity of the CNN model. For the classification problem focused on distinguishing participants with DE from those who were NDE, both models performed evenly (Table 3, Figure S3). The LSTM model’s sensitivity was higher, but the CNN model’s specificity was higher than its counterpart. Interestingly, in some cases, the model performance on the shorter segments was higher than the full audio recordings. For the classification problem focused on distinguishing participants with DE from those that have NC, we found improved LSTM model performance on both 10- and 15-min recordings, based on a few metrics. For the same classification problem, the CNN model’s performance on the full audio recordings is mostly the highest, with the 5-min segment’s mean sensitivity being the exception. For the classification problem focused on distinguishing participants with DE from those who were NDE, the LSTM model’s performance on the full recordings was the highest based on all the metrics. For the same classification problem, the CNN model’s performance on the full audio recordings is mostly the highest, with the 5-min segment’s mean sensitivity and the 10-min mean F1 score being the exceptions.
We computed the average (± standard deviation) of SAF[+] and SAF[−] derived from the CNN model (Table 4). The positive SAFs were calculated for all true positive recordings and the negative SAFs were calculated for all true negative recordings. For example, the SAF[+] for the “Verbal paired associates recognition” test was 0.88 ± 0.26. This indicates that on average 88% of the duration of the “Verbal paired associates recognition” tests administered in true positive recordings also intersected with segments of time that the model found to be DE[+] salient. Also, the SAF[−] for the “Verbal paired associates recognition” test was 0.32 ± 0.42. This indicates that on average 32% of the duration of the “Verbal paired associates recognition” tests administered in true negative recordings also intersected with segments of time that the model found to not be DE[+] salient. On the other hand, the SAF[+] for the “Command clock” test was 0.39 ± 0.45, indicating that only about 39% of the duration of the “Command clock” tests administered in true positive recordings also intersected with segments of time that the model found to be DE[+] salient. Also, the SAF[−] for the “Command clock” test was 0.76 ± 0.37. The rest of the average positive SAFs and average negative SAFs were also reported for the remaining neuropsychological tests as well as the number of true positive or true negative recordings that contained an administration of the given test (Table 4). We also computed average SAF[+] and SAF[−] for additional neuropsychological tests (Table S2), which were set aside due to the low number of samples. A schematic of the DE[+] saliency vectors used to generate SAF[+] and SAF[−] can be seen in Fig. 4. Each value in the saliency vector represented approximately 2 min and 30 s of a recording. Since the saliency vector covers the entire recording, each second of every neuropsychological test within a recording can be assigned a saliency vector value, which was then used to calculate SAF[+] and SAF[−].
Cognitive performance is affected by numerous factors that are independent of underlying actual cognitive status. Physical impairments (vision, hearing), mood (depression, anxiety), low education, cultural bias, and test-specific characteristics are just a few of the many variables that can lead to variable performance. Furthermore, neuropsychological tests that claim to test the same specific cognitive domains (e.g., memory, executive function, language, etc.) do not do so in a uniform manner. For example, when testing verbal memory, a paragraph recall test taps into a different set of underlying cognitive processes compared to that of a list learning task. Also, the presumption that a test measures only a single cognitive domain is naively simplistic since every neuropsychological test involves multiple cognitive domains to complete. Thus, across many factors, there is significant heterogeneity of cognitive performance that makes it difficult to differentiate between those who will become demented versus those who will not, especially at the preclinical stage. Particularly, traditional paper and pencil neuropsychological tests may not be sufficiently sensitive to pick up when subtle change begins. While cognitive complaints serve as a surrogate preclinical measure of decline, there is an inherent bias of self-report. Given this complex landscape, digital voice recordings of neuropsychological tests provide a data source of relative independence from those limitations. To our knowledge, our study is the first to demonstrate that a continuous stream of data is also amenable for automated analysis for the evaluation of cognitive status.
Digital health technologies in general and voice in particular are increasingly being evaluated as potential screening tools for depression [18,19,20,21] and various neurodegenerative diseases such as Parkinson’s disease [22,23,24,25]. Recently, potential opportunities for developing digital biomarkers based on mobile/wearables for AD were outlined [26, 27]. Our study is unique in focusing on two deep learning methods that rely on a hands-free approach for processing voice recordings to predict dementia status. The advantage of our approach is threefold. The first is the limited need to extensively process any of the voice recordings before sending them as inputs to the neural networks. This is a major advantage because it minimizes the burden of generating accurate transcriptions and/or handcrafted features that generally take time to develop and rely on the availability of experts who are not readily available. This aspect places us in a unique position compared to previously published work that depended on derived measures [28, 29]. Second, our approach can process audio recordings of variable lengths, which means that one does not have to format or select a specific window of the audio recording for analysis. This important strength underscores the generalizability of our work because one can process voice recordings containing various combinations of neuropsychological tests that are not bounded within a time frame. Finally, our approach allows for the identification of audio segments that are highly correlated with the outcome of interest. The advantage of doing this is that it provides a “window” into the machine learning black box; we can go back to the recordings and identify the various speech patterns or segments of the neuropsychological tests which point to a high probability of disease risk,and understand their contextual significance.
The CNN architecture allowed us to generate the saliency vectors by utilizing the parameters of the final classification layer. Simply put, a temporal saliency vector for each specific case could be obtained by calculating the weighted sum of the output feature vectors from the last convolutional block in the CNN, indicating how each portion of the recording contributed to either positive or negative prediction. We then aligned the saliency vectors with the recording timeline to further analyze the speech signatures to understand if there were any snippets of the neuropsychological testing that often were correlated with the output class label. From examining the transcriptions that exist for a portion of the dataset, we were able to identify which neuropsychological tests were occurring during any given time in a recording, and then calculate the positive SAFs. This implies that the neuropsychological tests found in these segments may be presenting a test in which the participant’s voice in their response has a signal related to their cognition. For example, the SAF[+] is high for the “Verbal paired associates recognition” test, which could mean that the participants’ audio signals during this test highly influenced the model performance. This result could also imply that most participants diagnosed with dementia may have had explicit episodic memory deficits. The exact connections between the voice signals in the segments identified by the saliency maps and their clinical relevance are worth exploring in the future.
All the FHS voice recordings contained audio content from two distinct speakers of which one is the interviewee (participant) and the other is the interviewer (examiner). We did not attempt to discern speaker-specific audio content as our models processed the entire audio recording at once. This choice was intentional because we wanted to first evaluate if deep learning can predict the dementia status of the participant without having to perform detailed feature engineering on the audio recordings. Future work could focus on processing these signals and recognizing speaker-related differences and interactions, periods of silence, and other nuances to make the audio recordings more amenable for deep learning. Also, additional studies can be performed to integrate the audio data with other routinely collected information that requires no additional processing (i.e., demographics) to augment model performance. An important point to note is that studies as proposed above need to be conducted with the goal of creating scalable solutions across the globe, particularly to those regions where technical or advanced clinical expertise is not readily available. This means that users at the point-of-care may not be in the best position to manually process voice recordings or any other data to derive needed features that can be fed into the computer models. Since our deep learning models do not require data preprocessing or handcrafted features, our approach can serve as a potential screening tool without having the need of an expert-level input. Our current findings serve as a first step towards achieving such a solution that can have a broader outreach.
The models were developed using the FHS voice recordings, which is a single population cohort from the New England area in the United States. Despite demonstrating consistent model performance using rigorous cross-validation (5-fold) approaches, our models still need to be validated using data from external cohorts to confirm their generalizability. Currently, we do not have access to any other cohort that has voice recordings of neuropsychological exams. Therefore, our study findings need to be interpreted considering this limitation and with the hope of evaluating them further in the future. Due to the lack of available data in some cases and because not all participants took all the types of neuropsychological tests, we were able to generate the distribution of times spent for a portion of the tests. It must be also noted that the number of NC, MCI, DE, and NDE participants varied for each neuropsychological test. Additionally, there may be outside factors affecting the amount of time it took a participant to complete a neuropsychological test that is not represented in the boxplots. For example, a normal participant could finish the BNT quickly and perform well, whereas administration of the BNT to a participant with dementia could be abruptly stopped due to an inability to complete the test or for other reasons. Therefore, the amount of time spent administering the BNT in the recordings in those two cases could be similar, but for different reasons. Nevertheless, statistical tests were performed to quantify the pairwise differences on all the available neuropsychological exams, which gave us the flexibility to report the differences that were statistically different and those that were similar. Also, while it is possible that the interviewer’s behavior can influence the interviewee’s response, we must acknowledge that all the interviewers are professionally trained to uniformly administer the neuropsychological tests. Finally, we must note that some participants who were included as NC at baseline assessment showed subtle changes in cognition sufficient to warrant a dementia review, and this may have affected the model performance.
Our proposed deep learning approaches (LSTM and CNN) to processing voice recordings in an automated fashion allowed us to classify dementia status on the FHS participants. Such approaches that rely minimally on neuropsychological expertise, audio transcription, or manual feature engineering can pave the way towards the development of real-time screening tools in dementia care, especially in resource-limited settings.
Availability of data and materials
Python scripts and sample data are made available on GitHub (https://github.com/vkola-lab/azrt2021). Data in this study cannot be shared publicly due to regulations of local ethical committees. Data might be made available to researchers upon request. All requests will be evaluated based on institutional and departmental policies.
Libon DJ, Swenson R, Ashendorf L, Bauer RM, Bowers D. Edith Kaplan and the Boston process approach. Clin Neuropsychol. 2013;27(8):1223–33. https://doi.org/10.1080/13854046.2013.833295.
Hinton G. Deep learning-a technology with the potential to transform health care. JAMA. 2018;320(11):1101–2. https://doi.org/10.1001/jama.2018.11100.
Tsao CW, Vasan RS. Cohort profile: the Framingham Heart Study (FHS): overview of milestones in cardiovascular epidemiology. Int J Epidemiol. 2015;44(6):1800–13. https://doi.org/10.1093/ije/dyv337.
Au R, Piers RJ, Devine S. How technology is reshaping cognitive assessment: lessons from the Framingham Heart Study. Neuropsychology. 2017;31(8):846–61. https://doi.org/10.1037/neu0000411.
Jak AJ, Preis SR, Beiser AS, Seshadri S, Wolf PA, Bondi MW, et al. Neuropsychological criteria for mild cognitive impairment and dementia risk in the Framingham Heart Study. J Int Neuropsychol Soc. 2016;22(9):937–43. https://doi.org/10.1017/S1355617716000199.
McGrath ER, Beiser AS, DeCarli C, Plourde KL, Vasan RS, Greenberg SM, et al. Blood pressure from mid- to late life and risk of incident dementia. Neurology. 2017;89(24):2447–54. https://doi.org/10.1212/WNL.0000000000004741.
Satizabal CL, Beiser AS, Chouraki V, Chene G, Dufouil C, Seshadri S. Incidence of dementia over three decades in the Framingham Heart Study. N Engl J Med. 2016;374(6):523–32. https://doi.org/10.1056/NEJMoa1504327.
McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer’s Disease. Neurology. 1984;34(7):939–44. https://doi.org/10.1212/WNL.34.7.939.
Yuan J, Maserejian N, Liu Y, Devine S, Gillis C, Massaro J, et al. Severity distribution of Alzheimer’s disease dementia and mild cognitive impairment in the Framingham Heart Study. J Alzheimers Dis. 2021;79(2):807–17. https://doi.org/10.3233/JAD-200786.
Chauhan S, Wang P, Sing Lim C, Anantharaman V. A computer-aided MFCC-based HMM system for automatic auscultation. Comput Biol Med. 2008;38(2):221–33. https://doi.org/10.1016/j.compbiomed.2007.10.006.
Deng M, Meng T, Cao J, Wang S, Zhang J, Fan H. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Netw. 2020;130:22–32. https://doi.org/10.1016/j.neunet.2020.06.015.
Jung SY, Liao CH, Wu YS, Yuan SM, Sun CT. Efficiently classifying lung sounds through depthwise separable CNN models with fused STFT and MFCC features. Diagnostics. 2021;11(4).
Kuresan H, Samiappan D, Masunda S. Fusion of WPT and MFCC feature extraction in Parkinson’s disease diagnosis. Technol Health Care. 2019;27(4):363–72. https://doi.org/10.3233/THC-181306.
Muheidat F, Harry Tyrer W, Popescu M. Walk identification using a smart carpet and Mel-Frequency Cepstral Coefficient (MFCC) features. Annu Int Conf IEEE Eng Med Biol Soc. 2018;2018:4249–52. https://doi.org/10.1109/EMBC.2018.8513340.
Nogueira DM, Ferreira CA, Gomes EF, Jorge AM. Classifying heart sounds using images of motifs, MFCC and temporal features. J Med Syst. 2019;43(6):168. https://doi.org/10.1007/s10916-019-1286-5.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1922.214.171.1245.
Wollacott AM, Xue C, Qin Q, Hua J, Bohnuud T, Viswanathan K, et al. Quantifying the nativeness of antibody sequences using long short-term memory networks. Protein Eng Des Sel. 2019;32(7):347–54. https://doi.org/10.1093/protein/gzz031.
Gonzalez GM, Costello CR, La Tourette TR, Joyce LK, Valenzuela M. Bilingual telephone-assisted computerized speech-recognition assessment: is a voice-activated computer program a culturally and linguistically appropriate tool for screening depression in English and Spanish? Cult Divers Ment Health. 1997;3(2):93–111. https://doi.org/10.1037/1099-9809.3.2.93.
Kim HG, Geppert J, Quan T, Bracha Y, Lupo V, Cutts DB. Screening for postpartum depression among low-income mothers using an interactive voice response system. Matern Child Health J. 2012;16(4):921–8. https://doi.org/10.1007/s10995-011-0817-6.
Munoz RF, McQuaid JR, Gonzalez GM, Dimas J, Rosales VA. Depression screening in a women’s clinic: using automated Spanish- and English-language voice recognition. J Consult Clin Psychol. 1999;67(4):502–10. https://doi.org/10.1037/0022-006X.67.4.502.
Ozkanca Y, Ozturk MG, Ekmekci MN, Atkins DC, Demiroglu C, Ghomi RH. Depression screening from voice samples of patients affected by Parkinson’s disease. Digit Biomark. 2019;3(2):72–82. https://doi.org/10.1159/000500354.
Arora S, Visanji NP, Mestre TA, Tsanas A, AlDakheel A, Connolly BS, et al. Investigating voice as a biomarker for leucine-rich repeat kinase 2-associated Parkinson’s disease. J Parkinsons Dis. 2018;8(4):503–10. https://doi.org/10.3233/JPD-181389.
Postuma RB. Voice changes in prodromal Parkinson’s disease: is a new biomarker within earshot? Sleep Med. 2016;19:148–9. https://doi.org/10.1016/j.sleep.2015.08.019.
Tracy JM, Ozkanca Y, Atkins DC, Hosseini Ghomi R. Investigating voice as a biomarker: deep phenotyping methods for early detection of Parkinson’s disease. J Biomed Inform. 2020;104:103362. https://doi.org/10.1016/j.jbi.2019.103362.
Arora S, Baghai-Ravary L, Tsanas A. Developing a large scale population screening tool for the assessment of Parkinson’s disease using telephone-quality voice. J Acoust Soc Am. 2019;145(5):2871–84. https://doi.org/10.1121/1.5100272.
Kourtis LC, Regele OB, Wright JM, Jones GB. Digital biomarkers for Alzheimer’s disease: the mobile/wearable devices opportunity. NPJ Digit Med. 2019;2(1). https://doi.org/10.1038/s41746-019-0084-2.
Gold M, Amatniek J, Carrillo MC, Cedarbaum JM, Hendrix JA, Miller BB, et al. Digital technologies as biomarkers, clinical outcomes assessment, and recruitment tools in Alzheimer’s disease clinical trials. Alzheimers Dement. 2018;4(1):234–42. https://doi.org/10.1016/j.trci.2018.04.003.
Eyigoz E, Mathur S, Santamaria M, Cecchi G, Naylor M. Linguistic markers predict onset of Alzheimer’s disease. EClinicalMedicine. 2020;28:100583. https://doi.org/10.1016/j.eclinm.2020.100583.
Thomas JA, Burkhardt HA, Chaudhry S, Ngo AD, Sharma S, Zhang L, et al. Assessing the utility of language and voice biomarkers to predict cognitive impairment in the Framingham Heart Study Cognitive Aging Cohort Data. J Alzheimers Dis. 2020;76(3):905–22. https://doi.org/10.3233/JAD-190783.
This project was supported in part by the Karen Toffler Charitable Trust, a subaward (32307-93) from the NIDDK Diabetic Complications Consortium grant (U24-DK115255), a Strategically Focused Research Network (SFRN) Center Grant (20SFRN35460031) from the American Heart Association, and a Hariri Research Award from the Hariri Institute for Computing and Computational Science & Engineering at Boston University, Framingham Heart Study’s National Heart, Lung and Blood Institute contract (N01-HC-25195; HHSN268201500001I) and NIH grants (R01-AG062109, R21-CA253498, R01-AG008122, R01-AG016495, R01-AG033040, R01-AG054156, R01-AG049810, U19 AG068753, and R01-GM135930). Additional support was provided by Boston University’s Affinity Research Collaboratives program, Boston University Alzheimer’s Disease Center (P30-AG013846), the National Science Foundation under grants DMS-1664644, CNS-1645681, and IIS-1914792, and the Office of Naval Research under grant N00014-19-1-2571.
Ethics approval and consent to participate
No ethics approval or participant consent was obtained because this study was based on retrospective data.
Consent for publication
All authors have approved the manuscript for publication.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Demographics of the participants who were non-demented (NDE, i.e., individuals with normal cognition (NC) and mild cognitive impairment (MCI)) at the time of the voice recordings. ApoE data was unavailable for one Gen 1 participant, thirteen Gen 2 participants, and one New Offspring Cohort (NOS) participant; MMSE data was not collected for all Gen 3, OmniGen 2, and NOS participants.
The dataset was first split into two parts such that a portion of the participants along with their recordings were kept aside for independent model testing. The models were trained on the remaining data using 5-fold cross-validation. We split the data on the participant level for each fold and then all of a given participant’s recordings were included in each fold.
Long short-term memory (LSTM) networks were used to classify participants who have normal cognition from those with dementia and were used to classify participants who were not demented from those who were demented. The models were trained on full audio recordings and the performance was reported on audio samples of variable lengths extracted from the test data (see Figure S1). Plots (A) and (B) denote the ROC and PR curves for the LSTM model's performance on the normal cognition versus dementia task and plots (C) and (D) denote the ROC and PR curves for the LSTM model's performance on the non-demented versus demented task.
Convolutional neural network (CNN) models were used to classify participants who have normal cognition from those with dementia and were used to classify participants who were not demented from those who were demented. Models were trained on full audio recordings and the performance was reported on audio samples of variable lengths extracted from the test data (see Figure S1). Plots (A) and (B) denote the ROC and PR curves for the CNN model's performance on the normal cognition versus dementia task and plots (C) and (D) denote the ROC and PR curves for the CNN model's performance on the non-demented versus demented task.
For the neuropsychological tests that have too few samples, the average salient administered fraction (SAF) and standard deviation for true positive (SAF[+]) and true negative (SAF[-]) cases are listed in descending order based on the SAF[+] value. SAF[+] is calculated by summing up the time spent in a given neuropsychological test that intersects with a segment of time that is DE[+] salient and dividing by the total time spent in a given neuropsychological test. SAF[-] is calculated by summing up the time spent in a given neuropsychological test that intersects with a segment of time that is not DE[+] salient and dividing by the total time spent in a given neuropsychological test. The number of samples for SAF[+] and SAF[-] indicate the number of true positive and true negative recordings that contain each neuropsychological test.
About this article
Cite this article
Xue, C., Karjadi, C., Paschalidis, I.C. et al. Detection of dementia on voice recordings using deep learning: a Framingham Heart Study. Alz Res Therapy 13, 146 (2021). https://doi.org/10.1186/s13195-021-00888-3