Collection of GWAS SNPs from large-scale studies
In this study, we assembled multiple single-nucleotide polymorphisms (SNPs) associated with AD from 15 large-scale GWAS studies in diverse population groups, conducted between 2007 and 2019 (Table S1). Some of the collected SNPs may represent the same genetic signal due to the use of overlapping samples across studies. To avoid this bias, we filtered the collected SNPs to remove redundant genetic signals (Supplementary Methods). To maximize genetic signals based on the omnigenic hypothesis [20], we adopted a loose threshold (P < 1 × 10−5) to collect and filter AD SNPs and finally obtained 106 unique GWAS SNPs for downstream analyses.
Construction of human protein–protein interactome
To build a comprehensive human protein–protein interactome, we assembled data from 15 common resources with multiple levels of experimental evidence (Supplementary Methods). Specifically, we focused on high-quality protein–protein interactions (PPIs) with the following five types of experimental data: (1) binary PPIs tested by high-throughput yeast-two-hybrid (Y2H) systems; (2) kinase-substrate interactions by literature-derived low-throughput and high-throughput experiments; (3) literature-curated PPIs identified by affinity purification followed by mass spectrometry (AP-MS), Y2H, and by literature-derived low-throughput experiments, and protein three-dimensional structures; (4) signaling network by literature-derived low-throughput experiments; (5) protein complex data (see Supplementary Methods). The genes were mapped to their Entrez ID based on the National Center for Biotechnology Information (NCBI) database (www.ncbi.nlm.nih.gov), and duplicated pairs were removed. Collectively, the integrated human interactome included 351,444 PPIs connecting 17,706 unique proteins. More details are provided in our recent studies [15, 16].
Collection of functional genomics data
We collected the distal regulatory element (DRE)-promoter links inferred from two studies. The first study was the capture Hi-C study of a lymphoblastoid cell line (GM12878) and we obtained 1,618,000 DRE-promoter links predicted from GM12878 [21]. The second dataset we used was from the Functional Annotation of the Mammalian Genome 5 (FANTOM5) project [22], in which cap analysis of gene expression (CAGE) technology was employed to infer enhancer-promoter links across multiple human tissues. We downloaded FANTOM5 data and obtained 66,899 enhancer-promoter links [22].
Disease-associated genes
Open Targets database refers to a comprehensive platform for therapeutic target identification and validation [23]. We collected 527 AD-associated genes (Table S2) from the Open Targets database (accessed in September, 2019).
Seed genes with experimental evidence for Alzheimer’s disease
We further collected 144 AD seed genes having either genetic, experimental, or functional evidence reported in large-scale GWAS studies, AD transgenic animal models, or human-derived samples (Table S2). These genes are involved in pathobiology of amyloidosis, tauopathy, or both, and genes characterizing other AD pathological hypotheses including neuroinflammation, vascular dementia, and other pathobiological pathways (Supplementary Methods).
Brain-specific gene expression
We downloaded RNA-Seq data (transcripts per million, TPM) of 31 tissues from GTEx V8 release (accessed on March 31, 2020, https://www.gtexportal.org/home/). We defined those genes with counts per million (CPM) ≥ 0.5 in over 90% of samples as tissue-expressed genes and the other genes as tissue-unexpressed. To quantify the expression significance of tissue-expressed gene i in tissue t, we calculated the average expression 〈E(i)〉 and the standard deviation δE(i) of a gene’s expression across all tissues evaluated. The significance of gene expression in tissue t is defined as
$${z}_E\left(i,t\right)=\left(E\left(i,t\right)-\left\langle E(i)\right\rangle \right)/{\delta}_E(i)$$
(1)
The details have been described in previous studies [15, 16].
Gene expression from single-cell/nucleus transcriptomics
We collected mouse single-cell/nucleus RNA sequencing (sc/snRNA-Seq) data in 5XFAD brain samples versus controls from two recent studies (GSE98969 and GSE140511) [24, 25]. We also collected human snRNA-Seq datasets on AD patient brain tissues from two publications [26, 27]. The first set of human snRNA-Seq data contains 10 frozen post-mortem human brain tissues from both entorhinal cortex (EC) and superior frontal gyrus (SFG) regions. This dataset has been deposited in the AD knowledge portal (https://adknowledgeportal.synapse.org, Synapse ID: syn21788402). The raw data were deposited on Gene Expression Omnibus (GEO ID: GSE147528), which contains astrocytes, excitatory neurons, inhibitory neurons, and microglia cells [27]. We also assembled human snRNA-Seq data in AD cases versus controls with entorhinal cortex samples across six major brain cell types (GEO ID: GSE138852): microglia, astrocytes, neurons, oligodendrocytes, oligodendrocyte progenitor cells (OPCs), and endothelial cells [26].
The original sc/snRNA-Seq datasets were downloaded from the GEO database (www.ncbi.nlm.nih.gov/geo/), and detailed information of these datasets is provided in Table S3. The analyses were completed with Seurat (v3.1.5), scran (v1.16.0), scater (v1.16.1) packages in R with steps complied with the original literature [24,25,26,27]. Data were normalized using a scaling factor of 10,000, and all differential gene expression analyses were conducted by function FindMarkers in the Seurat R package with parameter test.use = “MAST.” All mouse genes were further mapped to unique human-orthologous genes using the Mouse Genome Informatics (MGI) database (Eppig et al., 2017). Details of processing of sc/snRNA-seq data and quality control are provided in Supplementary Methods and our recent study [18].
Gene expression from microarray
We collected human microarray data in AD cases versus healthy controls with human brain samples from two independent datasets (GSE29378 and GSE84422) [28, 29]. We also collected mouse microarray data from AD transgenic mouse vs. controls, including brain microglia of 5XFAD mice from 2 independent datasets (GSE65067 and GSE74615) [30, 31], and brain hippocampus of Tg4510 mice (GSE53480 and GSE57583) [32].
The original microarray datasets were obtained from Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo). Detailed information of these 6 GEO datasets is provided in Table S3. All raw expression data were log2 transformed, and all samples were quantile normalized together. Probe IDs in each dataset were mapped to NCBI Entrez IDs, and probes mapping to multiple genome regions or without corresponding entrez IDs were deleted. The items were imported to R statistical processing environment using a LIMMA/Bioconductor package. All the mouse genes were further transferred into unique human-orthologous genes using the MGI database [33]. Genes with threshold fold change (FC) > 1.2 were defined as exhibiting differential expression and prioritized as predicted AD risk genes.
Bulk RNA sequencing data
We collected 2 RNA-seq datasets from brain or brain microglia of 5XFAD mice [34]. In addition, we obtained 4 RNA-seq datasets from brain microglia of Tg4510 mice across different months [M] age (2M, 4M, 6M, and 8M) [35]. Differential expression analysis was performed using DESeq [36], while threshold for significance of differential expression was set to FDR < 0.05 using Benjamini-Hochberg’s method. After mapping mouse genes to human-orthologous gene [33], we obtained 6 differentially expressed gene sets.
Proteomic data in AD models
In total, 10 proteomic datasets were assembled from 3 types of AD transgenic mouse models in two recent publications [37, 38]. The first study performed global quantitative proteomic analysis in hAPP and hAPP-PS1 mouse models at young (3 month [M]) and old ages (12 M) [37]. We obtained four sets of DEPs (hAPP_3M, hAPP_12M, hAPP-PS1_3M, and hAPP-PS1_12M) after merging the DEPs from different brain regions. The second study performed quantitative proteomics to uncover molecular and functional signatures in the hippocampus of three types of transgenic mice [38]. Two of these mouse lines, including ADLPAPT (4M, 7M, 10M) that carry three human transgenes (APP, PSEN1, and tau) and hAPP-PS1 (4M, 7M, 10M) mouse, were used in this study. After mapping mouse genes to human-orthologous gene [33], we obtained 10 sets of DEPs.
Enrichment analysis
Differentially expressed gene/protein (DEG/DEP) sets from multiple data sources were collected for enrichment analysis using Fisher’s exact test. This included a total of 6 bulk RNA-seq datasets and 10 proteomic datasets from 4 types of AD transgenic mouse models, including 5XFAD, Tg4510, ADLPAPT, and hAPP (see Supplementary Methods).
AD risk gene prediction
We utilized a Bayesian model selection method, adapted from our recent work [17], to predict ARGs. Specifically, we collected at most 20 genes in the 2-Mb region centered at a GWAS index SNP as the candidates for that particular locus. Assigning L as the number of GWAS loci, and we then denoted a vector of genes with length L, each being from one of the L GWAS loci, as (X1, …, XL), and termed it as candidate risk gene set (CRGS). Assigning N to represent the biological network, we then calculated P (X1,…, XL|N) with the goal to select a CRGS with maximum posterior probability. Computationally, it is not feasible to enumerate all possible gene combinations, and we therefore adopted a Gibbs sampling algorithm to transition the problem into a single-dimensional sampling procedure. For example, when sampling the risk gene from candidates at the Lth locus, we assumed that the risk genes at all other L-1 loci had been selected, and the sampling probability for a gene at the Lth locus was computed as conditional on the L-1 risk genes, based on its closeness to other L-1 risk genes in the network. For each candidate gene XL at the Lth locus, we assigned M1 to represent the event that XL is the risk gene at locus L, M0 represent the event that XL is not the risk gene at locus L, and X-L to represent all the selected risk genes in the other L-1 loci. The Bayesian model selection can be depicted as
$$\frac{P\left({M}_1|{X}_{-L},\kern0.5em N\right)}{P\left({M}_0|{X}_{-L},\kern0.5em N\right)}=\frac{P\left({M}_1\right)}{P\left({M}_0\right)}\frac{P\left({X}_{-L}|{M}_1,\kern0.5em N\right)}{P\left({X}_{-L}|{M}_0,\kern0.5em N\right)}$$
(2)
where \(\frac{P\left({X}_{-L}|{M}_1,\kern0.5em N\right)}{P\left({X}_{-L}|{M}_0,\kern0.5em N\right)}\) is a Bayesian Factor (BF) measuring the closeness between X-L and XL in network N and \(\frac{P\left({M}_1\right)}{P\left({M}_0\right)}\) is prior odds. The prior odds reflect the prior knowledge whether XL is a risk gene or not and we assumed P(M1) = P(M0) in this study.
In regard to \(\frac{P\left({X}_{-L}|{M}_1,\kern0.5em N\right)}{P\left({X}_{-L}|{M}_0,\kern0.5em N\right)}\), we adopted the random walk with restart (RWR) algorithm to calculate the BF. Starting from any node ni in a predefined network N, the walker faces two options at each step: either moving to a direct neighbor with a probability 1 − r or jumping back to ni with a probability r. The fixed parameter r is called the restart probability in RWR, and r was set as 0.3 in this study [17]. Let W be the adjacency matrix that decides which neighbor to be moved to, and qt be the reaching probability of all nodes at step t. The RWR algorithm is formalized as
$${q}_{t+1}=\left(1-r\right)W{q}_t+r{s}_{n_i}$$
(3)
\({s}_{n_i}\) is a vector with the ith element as 1 and 0 for others, which means th starting node is ni. Following the equation, qt can be updated step by step until |qt + 1 − qt|2 < Trwr, where Trwr is a predefined threshold. We set Trwr as 1 × 10−6 [17]. The adjacency matrix W represents the distance between any two nodes in the network, and we adopted the same network and strategy in our previous work to calculate W. We calculated P(X−L| M1, N) based on W. We mapped XL to the rows of W and X−L to the columns of W, and obtained a vector with the same length as X−L. The sum of the vector was calculated as P(X−L| M1, N). In this study, we assumed P(X−L| M0, N) to be the same for all different candidate genes. Through the Bayesian model selection equation,
$$\frac{P\left({M}_1|{X}_{-L},\kern0.5em N\right)}{P\left({M}_0|{X}_{-L},\kern0.5em N\right)}=\frac{P\left({M}_1\right)}{P\left({M}_0\right)}\frac{P\left({X}_{-L}|{M}_1,\kern0.5em N\right)}{P\left({X}_{-L}|{M}_0,\kern0.5em N\right)}$$
(4)
we obtained a value for each candidate genes at locus L. We used these values as sampling for Gibbs sampling to choose a risk gene for locus L. We then repeated the sampling across the remaining loci and iterated the sampling process until convergence. Specially, in each round of Gibbs sampling, we calculated the sampling frequency for each candidate gene. The frequency was compared with that of the previous round, and if the sum of squares of frequency differences across all selected genes was smaller than a predefined threshold (1 × 10−4 used in this study), then the sampling procedure was halted. Based on the sampling, we are able to assess the confidence of candidates being risk genes.
Construction of drug-target network
We integrated six commonly used resources to collect high-quality physical drug-target interactions for FDA-approved drugs. We obtained biophysical drug-target interactions using reported binding affinity data: inhibition constant/potency (Ki), dissociation constant (Kd), median effective concentration (EC50), or median inhibitory concentration (IC50) ≤ 10 μM. First, we extracted the bioactivity data from the DrugBank database (v4.3) [39], the Therapeutic Target Database (TTD, v4.3.02) [40], and the PharmGKB database [41]. To improve data quality, we pooled only those items that satisfied the following four criteria: (i) binding affinities, including Ki, Kd, IC50, or EC50, ≤ 10 μM; (ii) the target protein has a unique UniProt accession number; (iii) proteins marked as “reviewed” in the UniProt database; and (iv) proteins are from Homo sapiens. Totally, we constructed a drug-target network including 15,367 physical drug-target interactions (edges), which connected 1608 FDA-approved drug nodes and 2251 unique human target nodes (Table S4).
Description of network proximity
Given the set of disease proteins (A), the set of drug targets (B), then the closest distance dAB measured by the average shortest path length of all the nodes to the other module in the human protein–protein interactome can be defined as:
$$\left\langle {d}_{AB}\right\rangle =\frac{1}{\left|\left|A\right|\right|+\left\Vert B\right\Vert}\left(\sum_{a\in A}{\mathit{\min}}_{b\in B}d\left(a,b\right)+\sum_{b\in B}{\mathit{\min}}_{a\in A}d\left(a,b\right)\right)$$
(5)
where d(a, b) denotes to the shortest path length between protein a and drug target b.
To calculate the significance of the network distance between a given drug and disease module, we constructed a reference distance distribution corresponding to the expected distance between two randomly selected groups of proteins of the same size and degree distribution as the original disease proteins and drug targets in the network. This procedure was run 1000 times. The mean \(\overline{d}\) and standard deviation (σd) of the reference distribution were used to caluculate a z-score (zd) by converting an observed (non-Euclidean) distance d to a normalized distance.
Pharmacoepidemiologic validation
Patient cohort preparation
The pharmacoepidemiology study utilized the MarketScan Medicare Supplementary database from 2012 to 2017. The dataset included individual-level diagnosis codes, procedure codes, and pharmacy claims for 7.23 million U.S. older adults (i.e., age ≥ 65 to be eligible for Medicare benefits) per year, which represents approximately 14% of the 46 million retirees with Medicare benefits. Pharmacy prescriptions of pioglitazone, febuxostat, atenolol, nadolol, sotalol, and glipizide were identified by using RxNorm and National Drug Code (NDC). For a subject exposed to the aforementioned drugs, a drug episode is defined as the time between drug initiation and drug discontinuation. Specifically, drug initiation is defined as the first day of drug supply (i.e., first prescription date). Drug discontinuation is defined as the last day of drug supply (i.e., last prescription date + days of supply) accompanied by no drug supply for the next 60 days. Gaps of less than 60-day of drug supply were allowed within a drug episode. For example, the pioglitazone cohort included the first pioglitazone episode for each subject, as well as the glipizide cohort. Further, we excluded observations that started within 180 days of insurance enrollment. For the final cohorts, demographic variables including age, gender and geographical location were collected. Additionally, diagnoses of hypertension (HTN), type 2 diabetics (T2D), and coronary artery disease (CAD), defined by The International Statistical Classification of Diseases (ICD) 9/10 codes (Supplementary Methods, Table S5), before drug initiation, were collected. These variables were specifically selected to address potential confounding biases. Lastly, a control cohort was selected from patients not exposed to that drug (i.e., pioglitazone). Specifically, non-exposures were matched to the exposures (ratio 1:4) by initiation time of the drug, enrollment history, age, and gender. The geographical location, diagnoses of HTN, T2D, and CAD were collected for the control cohort as well.
Outcome measurement
The outcome was time from drug initiation to diagnosis of AD, which was defined by using the ICD codes (Supplementary Methods). For the control cohort, the corresponding drug (i.e., pioglitazone) episode’s starting date was used as the starting time. For pioglitazone and glipizide cohorts, observations without diagnose of AD were censored at the end of drug episodes. Observations without diagnosis of AD were censored at the corresponding pioglitazone episode’s end date (Fig. S1).
Propensity score estimation
We define Location = region of residence (i.e., North East, North Central, South, and West), T2D = type 2 diabetes, HT = hypertension, and CAD = coronary artery disease.
The propensity score of taking repurposing drug vs. a comparator drug was estimated by the following logistic regression model:
$$logit[Propensity\ Score] \sim Intercept+ Age+Gender+Location+T2D+HT+CAD.$$
(6)
Stratified Cox models were used to compare the AD risks. For repurposing drug vs. comparator drug or control, the analyses were stratified (n strata = 10) by the estimated propensity score. The propensity score adjusted Cox model is
$$log[Hazard]\sim Strata[log(Baseline\ Hazard) |Propensity \ Score]+1[Repurposing \ drug \ yes].$$
(7)
For repurposing drug vs. control, the analyses were stratified based on the subgroups defined by gender, T2D diagnose, HT diagnoses, and CAD diagnoses.
Statistical analysis
Survival curves for time to AD were estimated using a Kaplan-Meier estimator. Additionally, propensity score stratified survival analysis was conducted to investigate the risk of AD between pioglitazone users and pioglitazone non-users, febuxostat users and febuxostat non-users, atenolol users and atenolol non-users, nadolol users and nadolol non-users, and sotalol users and sotalol non-users. In addition, we conducted new comparison studies to calculate the risk of AD between pioglitazone users and glipizide users. For each comparison, the propensity score of taking each drug was estimated by using a logistic regression model in which covariates included age, gender, geographical location, T2D diagnosis, HTN diagnosis, and CAD diagnoses. Furthermore, propensity score stratified Cox-proportional hazards models were used to conduct statistical inference for the hazard ratios (HR) of developing AD between cohorts.
Experimental validation
Reagents
Pioglitazone was acquired from Topscience. Lipopolysaccharides (LPS) (Cat# L2880) and 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) were obtained from Sigma-Aldrich. Antibodies against Phospho-GSK3B-Y216 (Cat# AP0261), GSK3B (Cat# A2081), and CDK5 (Cat# A5730) were purchased from ABclonal Technology. CDK5-Phospho-Tyr15 (Cat# YP0380) was obtained from Immunoway (Plano, Texas, USA). All other reagents were purchased from Sigma-Aldrich unless otherwise specified.
Cell viability
Human microglia HMC3 cells were purchased from American Type Culture Collection (ATCC, Manassas, VA). Cell viability was detected by MTT method. In total, 5000 cells/well were plated in 96-well plates for 12 h, and then treated with pioglitazone for 48 h. After treatment, MTT solution was added to the cells to a final concentration of 1 mg/mL, and the mixture was allowed to incubate at 37 °C for 4 h. The supernatant was removed, and precipitates were dissolved in DMSO. Absorbance was measured at 570 nm using a Synergy H1 microplate reader (BioTek Instruments, Winooski, VT, USA).
Western blot analysis
HMC3 cells were pre-treated with pioglitazone (3 μM or 10 μM) and DMSO (control vehicle), and followed with 1 μg/mL LPS for 30 min. Cells were harvested, washed with cold PBS, and then lysed with RIPA Lysis Buffer with 1% Protease Inhibitor (Cat# P8340, Sigma-Aldrich). Total protein concentrations were measured using a standard BCA protein assay kit (Bio-Rad, CA, USA), according to the manufacturer’s manual. Samples were electrophoresed by sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE), then blotted onto a polyvinylidene difluoride (PVDF; EMD Millipore, Darmstadt, Germany) membrane. After transferring, membranes were probed with specific primary antibodies (1:1000) at 4 °C overnight. Specific protein bands were detected using a chemiluminescence reagent after hybridization with a horseradish peroxidase (HRP)-conjugated secondary antibody (1:3000).