What were the characteristics of the databases involved in reported studies?
Twenty-seven different databases were used in 52 studies, in which the appearance frequency of the Pitt corpus and ADReSS database were highest. Fourteen studies used Pitt corpus from Dementiabank, and 19 studies included the ADReSS database.
In 27 databases, 11 languages were used. Twenty-five databases used only one language in one database, including Spain, Chinese, English, Hungarian, Italian, Japanese, Brazilian Portuguese, and Swedish. Two databases used more than one language in one database. For example, AZTIAHO included English, French, Spanish, Catalan, Basque, Chinese, Arabian, and Portuguese.
In 29 databases, labels include AD (Alzheimer’s disease), MCI (mild cognitive impairment), and HC (healthy control). Eleven databases contain only AD and HC labels; 7 databases contain only MCI and HC labels; 11 databases contain AD, MCI, and HC labels.
For now, the databases in reported studies were small in single or few languages with uneven distribution. Besides, most were built for cross-sectional studies rather than cohort studies.
What deep learning model architectures were included in reported studies?
Four deep learning methods were applied in these selected papers: FNN, CNN, LSTM, and attention mechanism-based models. Figure 3 shows each number of these methods. These models were generally basic, and embeddings were extracted by models and collected for classification.
How were these deep learning model architectures used in reported studies?
The use of deep learning can be divided into three categories. First, the models trained on the large database were directly used to extract embedding, and then machine learning classifiers were used. Second, the models were pre-trained on a large database and then fine-tuned on dementia-related databases. In some situations, Self-training and data augmentation methods were used in the pre-trained process. Thirdly, deep learning models were built and trained from scratch using dementia-related databases.
What classification performance has been achieved?
The performance advantages of deep learning compared to the traditional method
Balagopalan, A. et al. tested on the ADReSS dataset using different classification models, including SVM, NB, RF, FNN, and BERT. According to the results presented in the paper, when using the FNN method, it can achieve an average accuracy of 77.08% on the ADReSS test set in 3 runs, which is higher than the performance of RF and NB but lower than the average accuracy of 81.25% for the SVM classifier. However, when using BERT, it got the best result for classification with an accuracy of 83.32% [54]. Not only linguistic features, but deep learning has also achieved better results on acoustic features. Bertini, F. et al. used an autoencoder to extract unsupervised features from audio data and then utilized FNN to achieve 93.3% classification accuracy on the Pitt dataset, which is better than the results obtained by traditional machine learning methods such as SVM, NB, and RF [33].
In the detection process of AD, utilizing deep learning methods can effectively improve the performance of the classification models when compared with traditional machine learning methods.
Besides, we compared methods without pre-training and methods with pre-trainig by box plotting in SS-PD-CT2 task with a test set for evaluation in Fig. 4. It exhibits that using the pre-training method is more useful than training models from scratch.
Performance difference based on different tasks
On the task selection, SS works better than others tasks generally. In 2017 and 2018, Lopez-De-Ipina, K. et al. conducted research on AD detection based on VF and SS tasks, in which acoustic features were mainly used. The detection accuracies on SS tasks were higher than the result on the VF task [72].
SS tasks can be divided into several different subtasks, including PD, Conversation/interview, and Recall.
In PD tasks, most tasks were based on ADReSS or Pitt database. There were 21 studies that used the ADReSS database and that 11 studies used the Pitt database. The test set on ADReSS database was uniform, detection accuracy in more than 75% of studies can reach more than 80%, and the best result can reach 91.67%. Cross-validation predictions from 85% of studies on the Pitt database exceeded 80% accuracy, and the best result can reach 91.25%. Ten reported studies contain conversation tasks [14, 16, 26, 27, 39, 42, 56, 64, 75, 76].
Though different databases were used, high accuracy can be achieved by cross-validation evaluation, in which 85% of studies exceeded 85% accuracy and the best result can reach 95%.
In Recall tasks, four related studies are included, and all can achieve 80% accuracy.
Comparisons of methods for the ADReSS Challenge
The ADReSS Challenge is the most recent internationally representative speech-based AD detection competition, which was held in Interspeech 2020–2021. The main objective of the ADReSS challenge is to make available a benchmark dataset of spontaneous speech, which is acoustically pre-processed and balanced in terms of age and gender, defining a shared task through which different approaches to AD recognition in spontaneous speech can be compared. Pre-training methods are mainly used in the top five participating teams of the ADReSS challenge, which include two types of useful ways of deep learning techniques.
The first way is pre-training based on deep learning architecture and large datasets, and then fine-tuning on the ADReSS dataset. Saltz, P. et al. [44]; Yuan, J. et al. [55]; and Zhu, Y. et al. [53] used BERT, ERNIE, Longformer-based model architecture to pre-train and then fine-tune, which reached 90%, 89.6%, and 89.58% on ADReSS test set respectively. In terms of characteristics, Saltz, P. et al. and Yuan, J. et al. used linguistic embedding only, and Zhu, Y. et al. used acoustic and linguistic embedding. Besides, Saltz, P. et al. used augmented data during the training stage, Yuan, J. et al. encoded the pause into the transcript and then acquired embedding vector for classification, and Zhu, Y. et al. used Longformer-based transfer learning.
The second way is extracting features based on deep learning architecture, and then training traditional machine learning classifiers based on the extracted features. Syed, Z. S. et al [51] combined traditional linguistic features and linguistic embedding extracted from a pre-trained BERT-based model, and then trained through ensemble learning and fused based on majority-voting, eventually reaching 91.67% accuracy on the ADReSS test set. Haulcy, R. et al. [50] extracted linguistic embedding from BERT with SVM or RF classifier and achieved 85.4% accuracy.
In addition, some other text-based pre-trained models work well. For example, the accuracies of BERT, part of BERT or BERT-based adaptation models [46, 47, 54, 65] were between 81% and 84.51%. Except for the text-based pre-trained models, audio and image-based pre-trained models also have been explored in speech-based AD detection. Chlasta, K. et al [48] trained modified VGGNet architecture to extract acoustic embedding, while Gauder, L. et al. [49] trained wav2vec 2.0 framework to extract acoustic embedding vector, of which both added modified CNN modules for classification, reaching 62.5% and 78.9% accuracy, respectively.
Another training method in the ADReSS Challenge is training from scratch. Traditional linguistic and acoustic features have been applied with the architectures such as FNN [34, 60], attention mechanism-based LSTM [86] and CNN-LSTM [36] model reached 83.33%, 64.58%, and 74.55% accuracy, respectively. After the duration features were added, BiLSTM with highway layers, CNN-BiLSTM-attention-based architecture [35], and dense layer with GRU model [37] reached 84%, 84%, and 72.92% accuracy, respectively.
When using limited clinical data, choosing proper pre-trained task and fine-tuned models are important and effective for disease classification. Generally, CNN-based architectures extract local information, and the LSTM or BERT-based model extracts temporal information. Specifically, pre-training a speech or text encoder with a large speech or text corpus, and using the attention mechanism to map the correspondence, then a fine-tuning model with AD or MCI dataset is a general method to build a framework to train the AD classification from scratch.
The algorithms and performances for detecting MCI
As an intermediate transition state between the normal aging process and mild AD, MCI plays an important role in early screening or AD. Among the screened papers, 16 of them performed MCI detection experiments. 11 of the 16 papers were about distinguishing MCI and healthy people, while the rest were about three classifications of AD, MCI patients, and cognitive normal elders.
For the classification of MCI versus cognitive normal subjects, Lindsay, Hali et al. [38] utilized three different pre-trained models (FastTest, Spacy, Wiki2Vec) to extract word embeddings, then used a SVM classifier to predict labels in different languages (French, German, Dutch), and can achieve 66%, 68%, and 69% AUC, respectively. For three-classification experiments for AD, MCI, and HC, Rodrigues Makiuch, M. et al. [39] using a gated convolutional neural network (GCNN), achieving an accuracy of 60.6% in 40 s of speech data.
MCI manifests as mild cognitive decline. Compared with AD, most MCI patients have less severe memory loss and perform relatively normal on memory tests. As can be seen from the papers we screened, it is more difficult to detect MCI patients than to distinguish AD patients from cognitive normal elders-based speech analysis. And we can find that there are not many studies on MCI detection at present, so it is of great value to further explore the methods of detecting MCI with deep learning techniques.
What were the mainstreams and limitations of reported studies?
The mainstreams and limitations of these selected studies were mainly reflected in language tasks, data modalities, extracted features, and model performance.
Language tasks
Varied databases were built to collect speech from AD and healthy people based on varied tasks. Through the databases we introduced in section 4.2 of this article, we can find that the current mainstream language tasks focus on: Semantic verbal fluency tasks, Spontaneous speech tasks, and some other reading tasks.
Semantic verbal fluency tasks contain animal naming tasks, vegetable, and location naming tasks. As for tasks collecting spontaneous speech, it compromised speech from interviews or conversations speech, recall tasks, and picture description tasks.
From this, we can find that there are many kinds of language tasks, which makes it difficult for researchers to compare their research results.
Therefore, based on the picture description task, the Pitt corpus and the ADReSS database have constructed comparable distribution-balanced databases, and researchers have begun to focus on these two databases for AD classification tests.
However, the languages of Pitt corpus and ADReSS databases are both English, and the amount of data is small, so the current research is also limited to a certain extent.
Data modalities
Based on our table in the “Deep learning techniques” section, we can see that researchers used speech, text, or speech and text to conduct experiments, in which some compared the classification results on the same evaluation test set.
The current research trend is to obtain more characteristic information by combining multimodal data. Different modalities have different representations, so there is some overlap and complementarity of information, as well as a variety of information interactions. Researchers may no longer be limited to the speech and text information of AD patients. Improving the accuracy of the overall decision-making results by integrating multi-modal data such as eye movement data, writing data, and gait performance is also an interesting topic that needs further investigation.
Extracted features
Traditional linguistic and acoustic features were mostly from handcrafted definitions thus these features were explainable. Deep learning-based feature extraction or classification techniques achieved high accuracy for AD classification but short of the lack of interpretability.
Deep learning-based feature extraction methods need a large scale of data, which is hard to precisely define and varies on a different scale of data. Besides, tasks were chosen to pre-train the model for features extraction, for example, ASR or BERT, were not fully compared and analyzed for AD classification tasks.
Model performance
How were these deep learning model architectures used in reported studies? and What classification performance has been achieved? In this paper, the deep learning model architectures and training strategies adopted by the selected papers are presented. In the current study, the researchers use the pre-training model to solve the problem of insufficient training data in AD detection and achieve good results. Most speech-based AD detection using deep learning methods can achieve an accuracy of about 85%. In the ADReSS challenge, some researchers have achieved an accuracy of nearly 90% using pretrained models. However, traditional cognitive impairment screening scales, such as MMSE or MOCA, can usually achieve a screening accuracy of more than 93% [5]. Therefore, as a more convenient AD detection method, speech-based deep learning technology needs to be further improved.