Skip to main content

Comparative analysis of machine learning algorithms for multi-syndrome classification of neurodegenerative syndromes



The entry of artificial intelligence into medicine is pending. Several methods have been used for the predictions of structured neuroimaging data, yet nobody compared them in this context.


Multi-class prediction is key for building computational aid systems for differential diagnosis. We compared support vector machine, random forest, gradient boosting, and deep feed-forward neural networks for the classification of different neurodegenerative syndromes based on structural magnetic resonance imaging.

Design, setting, and participants

Atlas-based volumetry was performed on multi-centric T1-weighted MRI data from 940 subjects, i.e., 124 healthy controls and 816 patients with ten different neurodegenerative diseases, leading to a multi-diagnostic multi-class classification task with eleven different classes.



Main outcomes and measures

Cohen’s kappa, accuracy, and F1-score to assess model performance.


Overall, the neural network produced both the best performance measures and the most robust results. The smaller classes however were better classified by either the ensemble learning methods or the support vector machine, while performance measures for small classes were comparatively low, as expected. Diseases with regionally specific and pronounced atrophy patterns were generally better classified than diseases with widespread and rather weak atrophy.

Conclusions and relevance

Our study furthermore underlines the necessity of larger data sets but also calls for a careful consideration of different machine learning methods that can handle the type of data and the classification task best.

Key points

Question: This study compares the different machine learning methods for predicting several neurodegenerative syndromes.

Findings: The comparison of support vector machine, random forest, gradient boosting, and deep feed-forward neural networks yielded the neural networks to be the best for the classification of different neurodegenerative syndromes based on pre-structured volume measures.

Meaning: Even with pre-structured data, deep neural networks are most promising.


In light of the demographic shift and the pending shortage of resources in healthcare systems across the globe, computer-aided methods are to shoulder some of the challenges. Supportive technology will find its way into the clinic to assist physicians in finding the correct diagnosis [1]. The implementation of artificial intelligence into clinical routine is happening already, and it is a matter of time until medical decisions will rely on algorithms in conjunction with the experience of physicians.

In case of neurodegenerative syndromes, brain imaging can render important MRI-morphological biomarkers in the form of atrophy patterns. While some focal atrophy patterns are quite disease-specific [2,3,4,5,6,7,8] leading even to incorporation into diagnostic criteria [9,10,11], neuroimaging findings for other diseases might be less conclusive [12]. However, it requires highly trained and specialized neuroradiologists to correctly detect and interpret the signs—an expertise that is not available ubiquitously.

For analyzing the complex multivariate and nonlinear relationships in high dimensional data derived from MRI data, machine learning algorithms are superior to standard inferential statistics [13, 14]. For the classification of neurological and psychiatric diseases, support vector machines (SVM) based on imaging-derived data have been the most popular method [14]. SVMs have proven to be a suitable approach at least in binary differentiations of patients from healthy controls [13,14,15]. A few studies further used SVM to differentiate disease entities from each other—a more complex approach that simulates the process of differential diagnosis. In a previous study, we assessed the performance of SVM to differentiate two dementia syndromes from each other [16, 17]. In another study, SVM was used to classify various parkinsonian syndromes based on the results of volumetric MRI analysis [18]. While SVM produced satisfactory results, other methodological approaches were not assessed further.

In recent years, deep learning methods have become more and more popular for pattern recognition tasks such as the classification of image and text data, but also of structured data [19]. Deep learning methods process data on several levels. In this way, more and more abstract representations are generated up to the class as the most abstract form of representation [19]. Deep neural networks (DNNs) in particular have proven to be highly proficient in predicting diagnoses based on imaging data of the eye, skin, or lung [20,21,22] and will most likely become a key component of imaging diagnostics in the future. Hopefully, these advanced models will be able to capture more complex atrophy patterns in the human brain than SVM approaches and might assist radiologists with their assessment in the future.

Accordingly, we will compare these models for the classification of neurodegenerative syndromes based on atlas-based volume measures in a very large dataset including numerous diseases in this work. Besides DNN and SVM, we will apply two ensemble learning methods (i.e., random forest (RF) and gradient boosting (GB)) that have been thriving as proficient models in many classification challenges dealing with similar data [14, 23]. The preprocessing of the data into structured data via volumetry in the form of an atlas is useful for clinical purposes, because it normalizes data, reduces thereby inter-center variability, guarantees a complete anonymization of the data, and decreases computing time when training the models.

The syndromes considered in this study all belong to the neurodegenerative disease spectrum ranging from Alzheimer’s disease (AD), frontotemporal lobar degeneration with its subtypes behavioral variant frontotemporal dementia (bvFTD), and primary progressive aphasias (PPA) with the three subforms—semantic variant (svPPA), nonfluent-agrammatic variant (nfvPPA), and logopenic variant (lvPPA)—to atypical Parkinson syndromes such as corticobasal syndrome (CBS), progressive supranuclear palsy (PSP), and multiple system atrophy with cerebellar features (MSA-C), as well as MSA with predominant parkinsonism (MSA-P) and idiopathic Parkinson’s disease (PD).

This use case is exemplary for imaging-derived structural data and can be transferred to other use cases of the biomedical sciences. By including ten different neurodegenerative diseases beside a control cohort, our approach mirrors best the work of radiologists in clinical routine, i.e., firstly categorizing a brain scan as normal or abnormal and secondly defining the neurodegenerative entity in the differential diagnostic process. We hypothesize (i) that neurodegenerative diseases can be classified with reasonable accuracy from structural brain imaging data, in particular, if they are characterized by specific atrophy patterns, and (ii) that DNNs perform better than SVM.


Subjects and demographic characteristics

The study included multi-centric data from 940 subjects, i.e., 124 healthy controls and 816 patients from the German Research Consortium of Frontotemporal Lobar Degeneration ( [24] and from the German Atypical Parkinson Consortium Study Group [18, 25]. The patient cohort consisted of 72 patients with AD, 146 patients with bvFTD, 26 patients with CBS, 30 patients with lvPPA, 21 patients with MSA-C, 60 patients with MSA-P, 58 patients with nfvPPA, 203 patients with PD, 154 patients with PSP, and 46 patients with svPPA.

Figure 1 and Table 1 provide an overview of the age and gender distribution of the study cohort. Age distribution was compared with the Kruskal-Wallis test and post hoc with a Wilcoxon rank-sum test between all pairs of samples (Bonferroni-corrected). Patients with AD were significantly older than patients with bvFTD (p < 0.05). Patients with PSP were significantly older than healthy controls (p < 0.001) and patients with MSA-C (p < 0.05), MSA-P (p < 0.001), PD (p < 0.05), bvFTD (p < 0.001), and svPPA (p < 0.001). Furthermore, patients with bvFTD were significantly younger than patients with nfvPPA (p < 0.001). Also, patients with svPPA were significantly younger than patients with nfvPPA (p < 0.05).

Fig. 1
figure 1

Violin plot of the age and gender distribution of the cohort sample. The dashed line indicates the mean, and the dotted line indicates the standard deviation. AD, Alzheimer’s disease; bvFTD, behavioral variant frontotemporal dementia; CBS, corticobasal syndrome; lvPPA, logopenic variant primary progressive aphasia; MSA-C, multiple system atrophy (cerebellar dysfunction subtype); MSA-P, multiple system atrophy (parkinsonian subtype); nfvPPA, nonfluent variant primary progressive aphasia; PD, Parkinson’s disease; PSP, progressive supranuclear palsy; svPPA, semantic variant primary progressive aphasia

Table 1 Demographic characteristics for patients and healthy controls

Gender distribution was tested pairwise with the Fisher test (Bonferroni-corrected) post hoc if the chi-square test indicated significant differences (chi-square = 38.855, p < 0.001). The gender distribution significantly differed between patients with bvFTD and PD (p < 0.001), MSA-P (p < 0.05), and PSP (p < 0.05). Furthermore, there was a significant difference in gender distribution between patients with PD and svPPA (p < 0.05).

The study was conducted according to the Declaration of Helsinki. It was approved by the local ethics committees of all participating centers. Patients, participants, caregivers, or legal representatives gave written informed consent for the study.

Imaging acquisition and analysis

Standardized structural MRI head scans were acquired multi-centrically at German university hospitals. Every subject obtained a T1-weighted three-dimensional (1-mm isovoxel resolution) magnetization-prepared rapid gradient echo (MPRAGE) head MRI brain scan [18, 24, 26]. The MPRAGE sequence was converted to ANALYZE 7.5 format, and the file names were pseudonymized before further processing. Whereas standardized operating procedures (SOPs) have been applied throughout the data acquisition including MRI in the German Research Consortium of Frontotemporal Lobar Degeneration, no sequence adjustment or homogenization between the centers was done in the German Atypical Parkinson Consortium Study. Instead, the MPRAGE sequence from the clinical routine at each center was used (for further information on MRI parameters, see papers and supplemental materials [4, 17, 18, 25, 27]). Atlas-based volumetric analysis of the MPRAGE sequence data was done using the LONI Probabilistic Brain Atlas (LPBA40) [28], and further masks were derived from this atlas. The atlas structures were used as an input vector for the model and represent the volume measures of the input data. A detailed description of all image processing steps and the 63 atlas structures included can be found in [18]. Before being used as predictive features, all volume results were corrected for intracranial volume (ICV).

Training and evaluation of classifiers

In order to reduce the bias of the existing sampling distribution, we used a 5-fold cross-validation with the full dataset (models were trained on 80% of the data (4-folds), 20% served for testing (1-fold)). The folds were selected randomly, and the experiments were repeated ten times. Thus, we trained and evaluated 50 models of each type (see Fig. 2). In each training iteration, we optimized the learning and hyperparameters of the RF model, the GB, and the SVM using Bayesian optimization. In contrast to a grid search or a random search, Bayesian optimization allows a sequential search and thus includes every previous search step. This leads to better optimization results [29]. During the optimization, overfitting was reduced by using a 5-fold cross-validation, where the training data was split into training and validation data. The optimization was run for 120 iterations.

Fig. 2
figure 2

Design of the repeated 5-fold cross-validation. These experiments are repeated ten times, with the individual folds composed differently in each repetition. The random compositions are controlled by the random seeds

The kernel applied in the SVM algorithm is critical for its way of functioning. Therefore, the learning and hyperparameters to be optimized also depend on the selected kernel. For this reason, the optimization of the SVM was done separately with linear, sigmoid, polynomial, and radial basis function kernels. Based on this comparison, we considered only SVM with a linear kernel. The linear kernel provided the best SVM performance, where c is optimized for avoiding misclassifications.

For the RF classifier, we optimized the maximum depth of the tree, the number of features to consider, the minimum number of samples required for a leaf node, and the minimum number of samples required to split an internal node. In case of GB optimization, we additionally optimized the learning rate.

We used a feed-forward DNN with 72 neurons in the input layer and 90 neurons in each of the two hidden layers. In order to prevent overfitting, we used a dropout rate [30] of 45% for the neurons of the hidden layers and early stopping [31]. For the weight update, we used Adam [32] and categorical cross-entropy, where the optimizer was initialized with α = 0.001, β1 = 0.9 =, β2 = 0.999, and ϵ = 10−8. The training was done batchwise with samples of 30 patients. More complex architectures have provided worse or similar classification results in the experiments.

The evaluation was done classwise considering the recall, the precision, and the F1-score. We dispense with a consideration of the overall recall, the overall precision, and the overall F-score, because of the included bias in the actual distribution, which limits the suitability of the F1-score for the model evaluation [33]. Therefore, this metric is not used for the overall model evaluation, but to show the distribution of predicted classes. To reduce this bias and evaluate the model performance, we use the Cohen’s kappa coefficient κ [34], which is defined as follows:

$$\kappa =\frac{y-{\hat{y}}}{1-{\hat{y}}}$$

where y denotes the ground truth syndrome classification and \(\hat{y}\) the predicted syndrome. In addition, we show the accuracy, which allows a more humanly interpretable evaluation, but without considering the class imbalance. The accuracy is calculated for the total number of elements n in the test fold and has the following form:


To better understand the classification process of each model, we used a novel technique, namely “local interpretable model-agnostic explanations” (LIME), which explains the relationship between the components (here, brain regions) that are used for the classification and its predicted class (here, the syndrome) [35]. This method allowed the direct comparison of the decision process of all four models.


In the following, the model performance is presented with its classification results and the features, i.e., the brain regions, that have the highest value for the classification decision.

Comparison of classification models

The parameter settings for the tree-based methods (i.e., GB and RF) were found through the optimization processes after around 80 iterations, while the configuration for the SVM with linear kernel required only around 20 iterations. In the training of the DNN, the validation loss decreased until convergence, which was reached after about 100 epochs on average. The training is stopped, if the validation score was stable for 20 epochs. The convergence and the subsequent overfitting on the training data are shown in Fig. 3.

Fig. 3
figure 3

Averaged training loss and validation loss of the DNN. For the consideration of overfitting, the early stopping was dispensed with for this recording

Model-wise performance measures can be found in Table 1. Among the models evaluated in this study, the DNN rendered the best classification results producing a Cohen’s kappa score slightly larger than 0.4 as well as a total model accuracy of approximately 0.5. The second-best performance was obtained with SVM, followed by GB and RF. Furthermore, the variability over 50 permutations was lowest for DNNs, which is reflected by the lowest standard deviation. This indicates that DNN models have the highest reliability of the models across different simulations.

Modelwise performance measures are shown separately for each of the classes, i.e., diseases, in Table 2. Whereas some diseases such as PSP, svPPA, MSA-P, bvFTD, and PD reached relatively high classification performance, other classes reached middle values, i.e., healthy controls and AD, and others relatively low performance such as lvPPA, MSA-C, and nfvPPA. Of note, CBS was characterized by the lowest performance results. The order of the modelwise performance quality across the whole cohort (DNN > SVM > GB > RF) was also observed for AD and bvFTD, whereas the other classes showed a more complex picture.

Table 2 Metrics for model comparison

Importance of brain regions

The LIME method allowed us to assess the contribution of each brain region for classifying each syndrome within a model. An entire listing with the weighting of all brain regions for all models is publically available in the project repository. In the interest of greater clarity, we display the five most important brain regions for all models for three selected pathologies with well-known atrophy patterns (i.e., AD, PSP, and svPPA; see Table 3). Note that the weighting of brain regions was averaged over all patients that were classified correctly by the respective model. All models independently identified the key regions, such as the midbrain for PSP, the inferior temporal gyrus on the left side for svPPA, and the hippocampus for AD.

Table 3 Class-wise performance metrics for multi-syndrome classification


In this work, we compared several well-established machine learning algorithms (i.e., DNN, GB, RF, and SVM) to predict the diagnosis out of numerous different neurodegenerative syndromes on the basis of pre-structured, atlas-based volumetric brain MRI data. In agreement with our hypothesis, we show that neurodegenerative diseases can be classified from structural brain imaging data, in particular, if they are characterized by specific atrophy patterns. Here, DNN showed a moderate performance, whereas the three other models showed a fair performance according to Cohen’s kappa scores. Although reasonable for this ambitious clinical question, results were not reaching substantial or even perfect classification results as achieved in comparisons of single neurodegenerative diseases vs. controls [1, 17, 27, 36,37,38]. This important difference between the diagnostic (disease vs. control) and differential diagnostic (disease vs. disease) approach might be related to etiological overlap between clinical syndromes, unspecific atrophy patterns for some diseases, and even the fact that single patients might show different syndromes in the course of the disease. These severe limitations, to be addressed in future studies, hamper the translation of multi-syndrome classifiers to clinical settings to date. In the following, we will discuss our results in more detail.

Structuring imaging data for machine learning approaches

Pre-structuring of the data with atlas-based volumetry had some clear advantages such as easy assessment of particular brain regions as contributing factors for the diagnosis on an individual level as well as across syndromes, thereby increasing the interpretability of the respective model. Moreover, data could be normalized individually by adjusting to the subject’s intracranial volume. Presumably, atlas-based volumetry seems to be also superior to voxel-based morphometry, because the impact of different centers, scanner types, protocols, and applied parameters seem to be decreased by processing steps in atlas-based volumetry—a hypothesis that has to be validated in future studies. Furthermore, using volumetry data also allowed for the training of a model on a single CPU core and with 6-GB RAM. In contrast to this, the training of a convolutional neural network (CNN) with raw imaging data [39], which is the state-of-the-art method for image classification, requires machines with at least one 12-GB GPU or in case of 3D MRI volumes a server with several GPUs [40]. Finally, pre-structuring of the data increased the anonymity of the data—a general benefit that facilitates central data aggregation without risking the exposure of privacy-sensitive medical information.

The reason we were not able to conduct the same experiment with raw imaging data was that we did not have access to the raw images. Despite all the advantages of pre-structured imaging data listed above, it precludes the possibility of data augmentation of raw imaging data—a powerful strategy to increase the amount of training data and thereby boosting model performance. Furthermore, and perhaps more importantly, predefined feature extraction might lead to a loss of valuable information, which is a clear limitation of our study.

Comparison of machine learning models

Corresponding to the literature [41], our results indicate that the DNN with a simple feed-forward architecture is the superior method for this kind of classification task, closely followed by the SVM as illustrated in Table 2. While neural networks became the state-of-the-art method for the processing of imaging data and text data, DNNs [42] were shown to outperform tree-based methods as well as SVM with structured data. However, it is informative to take a closer look at model performance and model robustness for every single class individually, especially considering the size of the class and the specificity of atrophy patterns, respectively (see Table 3). The DNN performed best (high F1-score and high robustness) in large classes (e.g., PD, bvFTD, AD, and PSP) where there was a sufficient data for the model loss to converge. Generally, classes with smaller sample sizes expectedly led to models with weaker performance measures. GB and SVM seemed to best perform for smaller classes (e.g., MSA-C, lvPPA, CBS), while RF rendered the best robustness for smaller classes. The high robustness of RF in this case might be due to the prediction ensembles, while the superior performance of GB and SVM over the DNN might reflect those models possibly needing less data than neural networks. Notably, classes with more specific atrophy patterns (e.g., svPPA and AD) were also best predicted by the DNN despite the comparatively small sample size possible due to the faster convergence of the loss function. As expected, diseases with regionally specific and pronounced atrophy patterns such as svPPA, AD, and PSP were generally better classified than diseases with widespread and rather weak atrophy such as CBS (see Fig. 4). The confusion matrices in Fig. 4 give a nice overview of the class-specific performance of the different methods and nicely show that the DNN has a reasonable performance for all classes.

Fig. 4
figure 4

Confusion matrix for every classification model. The values are averaged over all folds and figured as a percentage value of the true label. The coloration depends on the number of predicted cases in percent (the scale goes from the highest value (dark blue) to the value 0 (white)). The confusion matrix shows row normalized percentages, which results in precision are shown for matching classes in column and row

In conclusion, the larger the dataset, the better the performance. It was here, where the DNNs were able to clearly show their superiority with respect to classification performance as well as robustness. However, the point of convergence is the critical factor for good performance. For this, a balanced validation set must be used.


The validation was not only performed by using the prediction score, but also the standard deviation of the prediction scores as a measure of robustness. Generally, the standard deviation of the model performance depends on the training dataset used, which is why we chose k-fold cross validation [43] instead of a leave-one-out cross-validation. In contrast to a leave-one-out cross-validation, a k-fold cross-validation changes the class distribution in the training dataset over the different experiments, which affects the model training. When a leave-one-out cross-validation is performed, a class imbalance in the dataset always exists in a similar ratio (with the exception of the validation instance) and is therefore reflected in a lower model quality. The highest overall robustness was observed for the DNN while the ensemble methods in turn were least robust, possibly due to their general propensity to overfit the models.

Model performance

Both recall and precision are class-wise measures and are therewith independent of the number of true negatives, which are over-represented in a multiclass problem and thereby inflate measures contingent on the true negatives. The F1-score is a combination of both precision and recall and is supposed to give a more holistic measure of class-wise model performance.

For the overall model performance, accuracy is a popular measure, which we included in the reported metrics. However, in the case of a multiclass problem with a large imbalance, accuracy is not able to provide an honest reflection of the overall model performance. For this reason, we limit the consideration to the Cohen’s kappa score for the overall model evaluation (Table 2), because this score allows a normalization by the size of the respective class [34]. For the interpretation of the Cohen’s kappa score, the following scheme can be used: 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement [44]. According to this scheme, every DNN performs with a moderate performance and the three other models with a fair performance. The confusion matrix (Fig. 4) further visualizes how the DNN performs better for all different classes in comparison with the tree-based methods, where the model overfits towards the larger classes such as Parkinson, PSP, and bvFTD.

Feature importance

To better understand the process of decision-making of every model, we extracted the feature importance with the LIME method. LIME explains the model performance by approximating an explainable model that has exactly the same predictive behavior as the used classifier.

Despite the differences in performance metrics, all methods were able to reproduce well-known atrophy patterns of respective syndromes (see Table 4). Note that unlike in binary disease-vs.-healthy classification tasks, the interpretation of the feature importance resulting from a multiclass classification problem is more ambiguous. The “important features” listed above merely reflect which brain regions were most important to differentiate the respective diagnosis from all other diagnoses included in the classification task.

Table 4 Brain regions with the highest weighting, i.e., importance, for classification


While the use of volumetry data simplifies the task of classification, it simultaneously limits the classification basis to atrophy patterns only and excludes brain tissue that has no effect on atrophy. The two-stage approach consisting of the volumetry calculation and the classification of the diseases also carries the risk of error summation, which can lead to increased prediction error compared to approaches that are using the original data. Our study results might be limited by the unbalanced dataset, i.e., varying numbers in subjects per group. Although this variability reflects, at least partly, differences in prevalence and data availability, the findings of our study shall be validated in future more comprehensive, better balanced, and preferably international cohorts. Herewith, our results have to be validated externally to improve model generalization.


In conclusion, we found the DNN to be the best method to assess imaging-derived structured data. However, the performance of different methods largely depends on the dataset and the underlying classification problem. To select the optimal method, one should test and validate several methods and consider the available computing resources. Despite the mentioned advantages of pre-structuring brain data, our future work will extend the application of CNN [39] on raw MRI data as well, for which remarkable results have previously been achieved for the diagnosis of smaller numbers of neurodegenerative diseases [36, 45,46,47,48,49]. This addresses the aforementioned limitations; however, challenges arise in data privacy when processing the data and in the increased demand of training data, which requires further data acquisition considering the rarity of coverage of the various neurodegenerative syndromes.

Availability of data and materials

Project repository:


  1. Woo CW, Chang LJ, Lindquist MA, Wager TD. Building better biomarkers: brain models in translational neuroimaging. Nat Neurosci. 2017;20(3):365–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Laakso MP, Soininen H, Partanen K, et al. MRI of the hippocampus in Alzheimer’s disease: sensitivity, specificity, and analysis of the incorrectly classified subjects. Neurobiol Aging. 1998;19:23–31.

    Article  CAS  Google Scholar 

  3. Schroeter ML, Neumann J. Combined imaging markers dissociate Alzheimer’s disease and frontotemporal lobar degeneration - an ALE meta-analysis. Front Aging Neurosci. 2011;3:10.

    Article  Google Scholar 

  4. Albrecht F, Bisenius S, Morales SR, Neumann J, Schroeter ML. Disentangling the neural correlates of corticobasal syndrome and corticobasal degeneration with systematic and quantitative ALE meta-analyses. NPJ Parkinsons Dis. 2017;3:12.

    Article  Google Scholar 

  5. Albrecht F, Bisenius S, Neumann J, Whitwell J, Schroeter ML. Atrophy in midbrain & cerebral/cerebellar pedunculi is characteristic for progressive supranuclear palsy - a double-validation whole-brain meta-analysis. Neuroimage Clin. 2019;22:101722.

    Article  Google Scholar 

  6. Bisenius S, Neumann J, Schroeter ML. Response to the letter on ‘Validating new diagnostic imaging criteria for primary progressive aphasia via anatomical likelihood estimation meta-analyses’. Eur J Neurol. 2016;23:e52–3.

    Article  CAS  Google Scholar 

  7. Schroeter ML, Stein T, Maslowski N, Neumann J. Neural correlates of Alzheimer’s disease and mild cognitive impairment: a systematic and quantitative meta-analysis involving 1351 patients. Neuroimage. 2009;47:1196–206.

    Article  Google Scholar 

  8. Schroeter ML, Laird AR, Chwiesko C, et al. Conceptualizing neuropsychiatric diseases with multimodal data-driven meta-analyses - the case of behavioral variant frontotemporal dementia. Cortex. 2014;57:22–37.

    Article  Google Scholar 

  9. Gorno-Tempini ML, Hillis AE, Weintraub S, et al. Classification of primary progressive aphasia and its variants. Neurology. 2011;76:1006–14.

    Article  Google Scholar 

  10. McKhann GM, Knopman DS, Chertkow H, et al. The diagnosis of dementia due to Alzheimer’s disease: recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimers Dement. 2011;7:263–9.

    Article  Google Scholar 

  11. Rascovsky K, Hodges JR, Knopman D, et al. Sensitivity of revised diagnostic criteria for the behavioural variant of frontotemporal dementia. Brain. 2011;134:2456–77.

    Article  Google Scholar 

  12. Albrecht F, Ballarini T, Neumann J, Schroeter ML. FDG-PET hypometabolism is more sensitive than MRI atrophy in Parkinson’s disease: a whole-brain multimodal imaging meta-analysis. Neuroimage Clin. 2019;21:101594.

    Article  Google Scholar 

  13. Dwyer DB, Falkai P, Koutsouleris N. Machine learning approaches for clinical psychology and psychiatry. Annu Rev Clin Psychol. 2018;14:91–118.

    Article  Google Scholar 

  14. Arbabshirani MR, Plis S, Sui J, Calhoun VD. Single subject prediction of brain disorders in neuroimaging: promises and pitfalls. Neuroimage. 2017;145:137–65.

    Article  Google Scholar 

  15. Pellegrini E, Ballerini L, Hernandez MDCV, et al. Machine learning of neuroimaging for assisted diagnosis of cognitive impairment and dementia: a systematic review. Alzheimers Dement (Amst). 2018;10:519–35.

    Article  Google Scholar 

  16. Dukart J, Mueller K, Barthel H, Villringer A, Sabri O, Schroeter ML. Meta-analysis based SVM classification enables accurate detection of Alzheimer’s disease across different clinical centers using FDG-PET and MRI. Psychiatry Res. 2013;212:230–6.

    Article  Google Scholar 

  17. Bisenius S, Mueller K, Diehl-Schmid J, et al. Predicting primary progressive aphasias with support vector machine approaches in structural MRI data. Neuroimage Clin. 2017;14:334–43.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Huppertz HJ, Möller L, Südmeyer M, et al. Differentiation of neurodegenerative parkinsonian syndromes by volumetric magnetic resonance imaging analysis and support vector machine classification. Mov Disord. 2016;31:1506–17.

    Article  Google Scholar 

  19. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    Article  CAS  Google Scholar 

  20. Yu KH, Zhang C, Berry GJ, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun. 2016;7:12474.

    Article  CAS  Google Scholar 

  21. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8.

    Article  CAS  Google Scholar 

  22. Fauw JD, Ledsam JR, Romera-Paredes B, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342–50.

    Article  Google Scholar 

  23. Raczko E, Zagajewski B. Comparison of support vector machine random forest and neural network classifiers for tree species classification on airborne hyperspectral APEX images. Eur J Remote Sens. 2017;50(1):144–54.

    Article  Google Scholar 

  24. Otto M, Ludolph AC, Landwehrmeyer B, et al. German consortium for frontotemporal lobar degeneration. Nervenarzt. 2011;82:1002–5.

    Article  CAS  Google Scholar 

  25. Möller L, Kassubek J, Südmeyer M, et al. Manual MRI morphometry in parkinsonian syndromes. Mov Disord. 2017;32:778–82.

    Article  Google Scholar 

  26. Brant-Zawadzki M, Gillan GD, Nitz WR. MP RAGE: a three-dimensional T1-weighted, gradient-echo sequence–initial experience in the brain. Radiology. 1992;182(3):769–75.

    Article  CAS  PubMed  Google Scholar 

  27. Ballarini T, Albrecht F, Mueller K, et al. Disentangling brain functional network remodeling in corticobasal syndrome - a multimodal MRI study. Neuroimage Clin. 2020;25:102112.

    Article  PubMed  Google Scholar 

  28. Shattuck DW, Mirza M, Adisetiyo V, et al. Construction of a 3D probabilistic atlas of human cortical structures. Neuroimage. 2008;39:1064–80.

    Article  Google Scholar 

  29. Frazier PI. A tutorial on Bayesian optimization. arxiv. 2018; arXiv:1807.02811 [stat.ML].

    Google Scholar 

  30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15

  31. Prechelt L. Automatic early stopping using cross validation: quantifying the criteria. Neural Netw. 1998;11:761–7.

    Article  Google Scholar 

  32. Kingma DP, Ba J. Adam: a method for stochastic optimization. arxiv. 2015; arXiv:1412.6980 [cs.LG].

    Google Scholar 

  33. Powers DMW. What the F-measure doesn’t measure: features, flaws, fallacies and fixes. arxiv. 2015; arXiv:1503.06410 [cs.IR].

    Google Scholar 

  34. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.

    Article  Google Scholar 

  35. Ribeiro MT, Singh S, Guestrin C. Why should I trust you?: explaining the predictions of any classifier. arxiv. 2016;arXiv:1602.04938 [cs.LG].

    Book  Google Scholar 

  36. Albrecht F, Mueller K, Ballarini T, et al. Unraveling corticobasal syndrome and alien limb syndrome with structural brain imaging. Cortex. 2019;117:33–40.

    Article  PubMed  Google Scholar 

  37. Meyer S, Mueller K, Stuke K, et al. Predicting behavioral variant frontotemporal dementia with pattern classification in multi-center structural MRI data. Neuroimage Clin. 2017;14:656–62. eCollection 2017. PMID: 28348957 Free PMC article.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Mueller K, Jech R, Bonnet C, et al. Disease-specific regions outperform whole-brain approaches in identifying progressive supranuclear palsy: a multicentric MRI study. Front Neurosci. 2017;11:100. eCollection 2017.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

    Article  Google Scholar 

  40. Wen J, Thibeau-Sutre E, Diaz-Melo M, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. arxiv. 2019; arXiv:1904.07773 [cs.LG].

    Google Scholar 

  41. Huang GB, Chen YQ, Babri HA. Classification ability of single hidden layer feedforward neural networks. IEEE Trans Neural Netw. 2000;11:799–801.

    Article  CAS  Google Scholar 

  42. Bengio Y, Grandvalet Y. Bias in estimating the variance of K-fold cross-validation. In: Statistical modeling and analysis for complex data problems. Springer-Verlag. p. 75–95. doi:

  43. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.

    Article  CAS  Google Scholar 

  44. Farooq A, Anwar SM, Awais M, Rehman S. A deep CNN based multi-class classification of Alzheimers disease using MRI. In: 2017 IEEE International Conference on Imaging Systems and Techniques (IST): IEEE; 2017.

    Chapter  Google Scholar 

  45. Sarraf S, DeSouza DD, Anderson J, Tofighi G. DeepAD: Alzheimer’s disease classification via deep convolutional neural networks using MRI and fMRI; 2016.

    Book  Google Scholar 

  46. Yuan L, Wei X, Shen H, Zeng LL, Hu D. Multi-center brain imaging classification using a novel 3D CNN approach. IEEE Access. 2018;6:49925–34.

    Article  Google Scholar 

  47. Basaia S, Agosta F, Wagner L, et al. Automated classification of Alzheimer’s disease and mild cognitive impairment using a single MRI and deep neural networks. Neuroimage Clin. 2019;21:101645.

    Article  Google Scholar 

  48. Tapiola T, Pennanen C, Tapiola M, et al. MRI of hippocampus and entorhinal cortex in mild cognitive impairment: a follow-up study. Neurobiol Aging. 2008;29:31–8.

    Article  Google Scholar 

  49. Yves G, Yoshua B. Entropy regularization. In: Semi-supervised learning: The MIT Press; 2006. p. 151–68.

    Chapter  Google Scholar 

Download references


See the “Funding” section.


This work was supported by the German Federal Ministry of Education and Research (BMBF) by a grant given to the German FTLD Consortium (FKZ O1GI1007A), by the German Research Foundation DFG (SCHR 774/5-1), by the Parkinson’s Disease Foundation (PDF-IRG-1307), and by the Michael J. Fox Foundation (MJFF-11362). Günter Höglinger was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy within the framework of the Munich Cluster for Systems Neurology (EXC 2145 SyNergy – ID 390857198) and the NOMIS foundation (FTLD project).

Author information

Authors and Affiliations




LL, SN, and MLS designed the research study and analytical approaches. AM, SAS, KFs, KFl, HJ, JK, ML, JP, AS, MSy, AD, JDS, MO, KE, EH, RHR, AS, MSü, WO, JK, and GH contributed to the data. LL, SN, HJH, JR, KM, AV, and MLS contributed essentially to the data analysis and interpretation. LL and SN wrote the first draft of the manuscript and revised by MLS. All authors contributed to the paper and agreed to the final version of the manuscript.

Corresponding authors

Correspondence to Leonie Lampe or Matthias L. Schroeter.

Ethics declarations

Ethics approval and consent to participate


Consent for publication

All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lampe, L., Niehaus, S., Huppertz, HJ. et al. Comparative analysis of machine learning algorithms for multi-syndrome classification of neurodegenerative syndromes. Alz Res Therapy 14, 62 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: