Data collection and preprocessing
AD data sets
A data set is defined as either (1) genes/proteins/metabolites that are differentially expressed in AD patients/mice vs. controls, or (2) genes that have known associations with risks of AD from literature or other databases. We retrieved expression data sets underlying AD pathogenesis capturing transcriptomics (microarray, bulk or single-cell RNA-Seq) and proteomics across human, mouse, and model organisms (e.g., fruit fly and Caenorhabditis elegans). All the samples of the data sets were derived from total brain, specific brain regions (including hippocampus, cortex, and cerebellum), and brain-derived single cells, such as microglial cells. For some of the expression data sets, the differentially expressed genes/proteins were obtained from the original publications (from main tables or supplemental tables). For other data sets that did not have such differential expression results available, the original brain microarray/RNA-Seq data were obtained from Gene Expression Omnibus (GEO) [17] and differential expression analysis was performed using the tool GEO2R [18]. GEO2R performs the differential expression analysis for the sample groups defined by the user using the limma R package [19]. All differentially expressed genes identified in mouse were further mapped to unique human-orthologous genes using the NCBI HomoloGene database (https://www.ncbi.nlm.nih.gov/homologene). The details for all the data sets, including organism, genetic model (for mouse), brain region, cell type (for single-cell RNA-Seq), PubMed ID, GEO ID, and the sources (e.g., supplemental table or GEO2R), can be found in Table S1.
Genes and proteins
We retrieved the gene information from the HUGO Gene Nomenclature Committee (HGNC, https://www.genenames.org/) [20], including gene symbol, name, type (e.g., coding and non-coding), chromosome, synonyms, and identification (ID) mapping in various other databases such as NCBI Gene, ENSEMBL, and UniProt. All proteins from the AD proteomics data sets were mapped to genes using the mapping information from HGNC.
Single-nucleotide polymorphisms (SNPs)
We found 3321 AD-associated genetic records for 1268 genes mapped to 1629 SNPs, by combining results from GWAS Catalog (https://www.ebi.ac.uk/gwas/) [21] using the trait “Alzheimer’s disease” and published studies. The PubMed IDs for the genetic evidence are provided in AlzGPS.
Tissue expression specificity
We downloaded RNA-Seq data (transcripts per million, TPM) across 33 human tissues from the GTEx v8 release (accessed on March 31, 2020, https://gtexportal.org/home/). We defined the genes with count per million (CPM) ≥ 0.5 in over 90% samples (e.g., brain) as tissue-expressed genes and otherwise as tissue-unexpressed. To quantify the expression significance of tissue-expressed gene i in tissue t, we calculated the average expression 〈E(i)〉 and the standard deviation δE(i) of a gene’s expression across all included tissues. The significance of gene expression in tissue t is defined as:
$$ {z}_E\left(i,t\right)=\frac{E\left(i,t\right)-\left\langle E(i)\right\rangle }{\delta_E(i)} $$
(1)
Data for multiple brain regions were available from GTEx v8. We combined the data of these brain regions when comparing the brain expression specificity vs. other tissues. In addition, we further computed the expression specificity across 13 different brain regions. Both tissue expression specificity and brain region expression specificity results for the genes are available in AlzGPS.
Drugs
We retrieved drug information from the DrugBank database (v4.3) [22], including name, type, group (approved, investigational, etc.), Simplified Molecular-Input Line Entry System (SMILES), and Anatomical Therapeutic Chemical (ATC) code(s). We also evaluated the pharmacokinetic properties (such as blood–brain barrier [BBB]) of the drugs using admetSAR [23, 24].
Drug literature information for AD treatment
For the top 300 repurposable drugs (i.e., drugs with the highest number of significant proximities to the AD data sets), we manually searched and curated the literature for their therapeutic efficacy against AD using PubMed. In addition to the title, journal, and PubMed ID, we summarized the types (clinical and non-clinical), experimental settings (e.g., mouse/human and transgenic line for non-clinical studies; patient groups, randomization type, length, and control type of clinical studies), and results of these studies. In total, we found 292 studies for 147 drugs.
Drug-target network
To build a high-quality drug-target network, several databases were accessed, including the DrugBank database (v4.3) [22], Therapeutic Target Database (TTD) [25], PharmGKB database, ChEMBL (v20) [26], BindingDB [27], and IUPHAR/BPS Guide to PHARMACOLOGY [28]. Only biophysical drug-target interactions involving human proteins were included. To ensure data quality, we kept only interactions that have inhibition constant/potency (Ki), dissociation constant (Kd), median effective concentration (EC50), or median inhibitory concentration (IC50) ≤ 10 μM. The final drug-target network contains 21,965 interactions among 2892 drugs and 2847 human targets/proteins.
Clinical trials
The AD interventional clinical trials were retrieved from https://clinicaltrials.gov. Information including phase, posted date, status, and agent(s) was obtained from https://clinicaltrials.gov. Drugs were mapped to the DrugBank IDs. Proposed mechanism and therapeutic purpose were from Cummings et al. [29, 30].
Human protein interactome
We used our previously built high-quality comprehensive human protein interactome which contains 351,444 unique protein-protein interactions (PPIs, edges) among 17,706 proteins (nodes) [11, 12, 31, 32]. Briefly, five types of evidence were considered for building the interactome: physical PPIs from protein three-dimensional (3D) structures, binary PPIs revealed by high-throughput yeast-two-hybrid (Y2H) systems, kinase-substrate interactions by literature-derived low-throughput or high-throughput experiments, signaling networks by literature-derived low-throughput experiments, and literature-curated PPIs identified by affinity purification followed by mass spectrometry (AP-MS), Y2H, or by literature-derived low-throughput experiments. The details are provided in our previous studies [11, 12, 31, 32].
Network proximity quantification of drugs and AD data sets
To quantify the associations between drugs and AD-related gene sets from the data sets, we adopted the “closest” network proximity measure:
$$ \left\langle {d}_{AB}\right\rangle =\frac{1}{\left|\left|A\right|\right|+\left\Vert B\right\Vert}\left(\sum \limits_{a\in A}{\min}_{b\in B}d\left(a,b\right)+\sum \limits_{b\in B}{\min}_{a\in A}d\left(a,b\right)\right) $$
(2)
where d(a, b) is the shortest path length between gene a and b from gene list A (drug targets) and B (AD genes), respectively. To evaluate whether such proximity was significant, we performed z score normalization using a permutation test of 1000 random experiments. In each random experiment, two randomly generated gene lists that have similar degree distributions to A and B were measured for the proximity. The z score was calculated as:
$$ {z}_d=\frac{d-\overline{d}}{\sigma_d} $$
(3)
P value was calculated according to the permutation test. Drug-data set pairs with Z < − 1.5 and P < 0.05 were considered significantly proximal. In addition to network proximity, we calculated two additional metrics, overlap coefficient C and Jaccard index J, to quantify the overlap and similarity of A and B:
$$ C=\frac{\left|A\cap B\right|}{\min \left(\left|A\right|,\left|B\right|\right)} $$
(4)
$$ J=\frac{\left|A\cap B\right|}{\left|A\cup B\right|} $$
(5)
Generation of gene/protein networks
We offer three types of networks in AlzGPS: brain-specific neighborhood (EGO) network for the genes, largest connected component (LCC) network for the data sets, and inferred MOA network for significantly proximal drug-data set pairs. The three networks differ by inclusion criteria of the nodes (genes/proteins). The edges are PPIs colored by their types (e.g., 3D, Y2H, and literature). All networks are colored by whether they can be targeted by the drugs in our database.
For the EGO networks, we filtered genes by their brain expression and generated only the network for those that were considered to be expressed in brain using GTEx data. We used the ego_graph function from NetworkX [33] to generate the EGO networks. The networks are centered around the genes-of-interest. We incorporated the tissue specificity of the genes (indicated in the network by the node size) into the visualization tool, to allow users to further filter the network to show only the genes that have positive brain specificity.
An LCC network was generated for each AD data set using the subgraph function from networkx. For MOA, we examined the connections (PPIs) among the drug targets and the data sets.
Website implementation
AlzGPS was implemented with the Django v3.1.0 framework (www.djangoproject.com). The website frontend was implemented with HTML, CSS, and JavaScript. The frontend was designed to be highly interactive and integrative. It uses AJAX to asynchronously acquire data in JSON format based on user requests to dynamically update the frontend interface. This architecture can therefore be integrated into end users’ own pipelines. Network visualizations were implemented using Cytoscape.js [34].