Improving 3D convolutional neural network comprehensibility via interactive visualization of relevance maps: evaluation in Alzheimer’s disease

Table 3 Group separation performance for hippocampus volume and the convolutional neural network models

Sample	Hippocampus volume (residuals)		3D convolutional neural network
	Balanced accuracy (mean ± SD)	*AUC*	Balanced accuracy (mean ± SD)	*AUC* (mean ± SD)
ADNI-GO/2
MCI vs. CN	(70.0% ± 6.8%)	(0.773 ± 0.091)	(74.5% ± 6.2%)	(0.785 ± 0.078)
AD vs. CN	(84.4% ± 3.6%)	(0.945 ± 0.024)	(88.9% ± 5.3%)	(0.949 ± 0.029)
MCI⁺ vs. CN⁻	(75.6% ± 7.1%)	(0.831 ± 0.080)	(86.7% ± 10.3%)	(0.925 ± 0.071)
AD⁺ vs. CN⁻	(86.2% ± 4.2%)	(0.954 ± 0.025)	(94.9% ± 3.8%)	(0.985 ± 0.017)
ADNI-3
MCI vs. CN	62.8% (63.1% ± 1.4%)	0.683	63.1% (63.6% ± 1.5%)	0.684 (0.677 ± 0.020)
AD vs. CN	83.4% (83.4% ± 0.4%)	0.917	84.4% (81.7% ± 2.9%)	0.913 (0.899 ± 0.013)
MCI⁺ vs. CN⁻	69.1% (69.2% ± 2.7%)	0.791	69.8% (68.3% ± 4.4%)	0.810 (0.742 ± 0.024)
AD⁺ vs. CN⁻	83.6% (82.0% ± 1.8%)	0.882	80.2% (75.5% ± 4.2%)	0.830 (0.828 ± 0.028)
AIBL
MCI vs. CN	67.4% (67.6% ± 0.5%)	0.741	68.2% (67.3% ± 2.7%)	0.763 (0.749 ± 0.012)
AD vs. CN	84.1% (85.3% ± 1.5%)	0.927	85.0% (82.3% ± 3.0%)	0.950 (0.926 ± 0.007)
MCI⁺ vs. CN⁻	78.5% (78.8% ± 0.9%)	0.874	75.4% (73.6% ± 3.1%)	0.828 (0.814 ± 0.022)
AD⁺ vs. CN⁻	87.2% (89.1% ± 2.4%)	0.976	88.3% (85.3% ± 3.3%)	0.978 (0.958 ± 0.011)
DELCODE
MCI vs. CN	69.0% (69.0% ± 9.6%)	0.774	71.0% (69.7% ± 2.6%)	0.775 (0.772 ± 0.017)
AD vs. CN	88.4% (86.4% ± 3.0%)	0.943	85.5% (80.5% ± 4.0%)	0.953 (0.938 ± 0.013)
MCI⁺ vs. CN⁻	77.4% (77.8% ± 0.7%)	0.867	72.2% (74.9% ± 3.5%)	0.840 (0.830 ± 0.017)
AD⁺ vs. CN⁻	88.2% (87.6% ± 1.8%)	0.954	83.3% (82.2% ± 4.0%)	0.968 (0.956 ± 0.012)

Reported values are for the single model trained on the whole ADNI-GO/2 dataset. In parenthesis, the mean values and standard deviation for the ten models trained in the tenfold cross-validation procedure are provided to indicate the variability of the measures. Values for the ADNI-GO/2 sample (in italics) may be biased as the respective test subsamples were used to determine the optimal model during training. We still report them for better comparison of the model performance across samples

ISSN: 1758-9193