PRS distribution
The PRS aggregates the effects of multiple genetic markers identified by GWAS. Generally, the PRS is expected to be higher in cases than in controls, indicating a higher genetic risk for the disorder, but the difference in mean PRS between case and control samples may be small. It is important to note that the PRS calculated for an individual does not provide an absolute measure of risk and is meaningless except in relation with the distribution of PRS in cases and non-cases in the underlying population.
The polygenic risk score for individual j ∈ {1, …, Nind} is \( {PRS}_j=\frac{1}{N_{snps}}{\sum}_{i=1}^{N_{snps}}{g}_{ij}{\beta}_i \), where Nind and Nsnps are the numbers of individuals and of SNPs contributing to the PRS, respectively, gij ∈ {0, 1, 2} is the genotype of SNP i for individual j, and βi is the effect size (logarithm of the odds ratio or logistic regression coefficient) of SNP i in an independent GWAS for the disease. The sample mean and variance are
$$ m(PRS)=\frac{1}{N_{ind}}{\sum}_{j=1}^{N_{ind}}{PRS}_j\kern1em \mathrm{and}\kern1em \operatorname{var}(PRS)=\frac{1}{N_{ind}}{\sum}_{j=1}^{N_{ind}}{\left({PRS}_j-m(PRS)\right)}^2. $$
(1)
Estimation of PRS distribution parameters for unscreened controls
Our calculations require the distribution parameters of the PRS in cases and non-cases (putative non-affected controls). If the mean m0 and variance \( {\sigma}_0^2 \) of the PRS distribution in non-cases are unknown as unscreened population controls are used, they can be inferred from the means m1, mp and variances \( {\sigma}_1^2 \), \( {\sigma}_p^2 \) of the PRS distributions in cases and in the population, respectively, and the disease prevalence K as
$$ {m}_0=\frac{m_p-K\ {m}_1}{1-K} $$
and
$$ {\sigma}_0^2=\frac{\sigma_p^2-K{\sigma}_1^2}{1-K}-\frac{K{\left({m}_p-{m}_1\right)}^2}{{\left(1-K\right)}^2} $$
(see Supplemental Note 1 for details).
Estimation of the probability of disease development
By Bayesian inversion, a raw probability \( \hat{P} \) to be affected by the disease can be inferred from an individual’s PRS value x and the distribution densities of PRS in cases, p1, and in controls, p0, as
$$ \hat{P}(x)=\frac{K\ {p}_1(x)}{K\ {p}_1(x)+\left(1-K\right){p}_0(x)} $$
(2)
However, \( \hat{P} \) cannot be directly interpreted as a probability of disease. Logistic regression from case/control samples gives the probability of disease in the logistic model
$$ P(x)=\frac{1}{1+{e}^{-\left(\alpha +\beta x\right)}} $$
(3)
with coefficients α, β arising as regression parameters from the maximum likelihood estimate. We use linear regression with the logit link function, taking as data the log odds ratio corresponding to (2),
$$ y=\log\ \frac{K\ {p}_1(x)}{\left(1-K\right){p}_0(x)}, $$
(4)
at every PRS value x and the joint probability density of PRS in the population, pp(x) = Kp1(x) + (1 − K)p0(x), as weight. For normal densities p1 and p0, the coefficients α, β can be expressed as
$$ \alpha =\log \frac{K{\sigma}_0}{\left(1-K\right){\sigma}_1}+\frac{1}{2}\left(\left({r}_0-1\right)K+\left(1-{r}_1\right)\left(1-K\right)\right)-{m}_p\beta, \beta =\frac{m_1-{m}_0}{\sigma_p^2}\left(K\left(1-K\right)\left(\frac{r_0+{r}_1}{2}-1\right)+K\frac{\sigma_1^2}{\sigma_0^2}+\left(1-K\right)\frac{\sigma_0^2}{\sigma_1^2}\right) $$
(5)
where \( {r}_1=\frac{\sigma_0^2+{\left({m}_1-{m}_0\right)}^2}{\sigma_1^2} \) , \( {r}_0=\frac{\sigma_1^2+{\left({m}_1-{m}_0\right)}^2}{\sigma_0^2} \) , mp = Km1 + (1 − K)m0, and\( {\sigma}_p^2=K\ {\sigma}_1^2+\left(1-K\right){\sigma}_0^2+K\left(1-K\right){\left({m}_1-{m}_0\right)}^2 \) (see Supplemental Note 2 for details).
Formulae (5) determine the parameters of the logistic probability model (3) from the disease prevalence and the parameters of the distribution of PRS in cases and non-cases, dispensing with the need to obtain or simulate individual genotypes and perform logistic regression on the resulting PRS. They rely on the assumptions that the PRS distributions are normal and that the raw probability (2) represents well the fraction of cases for any value of PRS. For validation, we compared the outcome of (5) with the following three procedures of increasing abstraction, (a) simulation of genotypes in HWE with given MAF in cases and in non-cases and logistic regression of the resulting PRS, (b) sampling from normal distributions for PRS in cases and in non-cases with parameters \( {m}_0,{m}_1,{\sigma}_0^2,{\sigma}_1^2 \) and logistic regression, (c) sampling from the population distribution pp and linear regression of the raw log odds ratio (4).
Inclusion of rare variants in the probability
The effects of rare genetic variants with high (or medium) disease penetrance may be obscured if modelled as part of PRS including a large number of other SNPs, and the fraction of correctly identified cases carrying a rare mutation will be small in a sample and have little influence on the overall prediction accuracy. Therefore, it seems better to account for them at the level of the disease probability. Suppose we have the logistic regression model for the probability of disease PPRS in terms of the PRS by formulae (3) and (5), excluding the rare variant from the calculation of the PRS. An individual with PRS value x who carries a rare genetic variant with intrinsic probability prare to cause the disease has, assuming the effects of the rare variant and of the polygenic risk are independent, the probability of disease
$$ P(x)={P}_{PRS}(x)+{p}_{rare}\left(1-{P}_{PRS}(x)\right)=\frac{1+{p}_{rare}\ {e}^{-\left(\alpha +\beta x\right)}}{1+{e}^{-\left(\alpha +\beta x\right)}} $$
where x is the PRS for the individual. For very rare variant alleles that do not affect the disease prevalence K in the population, the intrinsic probability can be estimated as
$$ {p}_{rare}=\frac{K\ \left( OR-1\right)}{K\ \left( OR-1\right)+1}, $$
where OR is the odds ratio (see Supplemental Note 3). The probability P(x) takes values between prare and 1, reflecting the liability of the rare variant to cause the disease even in absence of polygenic risk. In case of several rare variants with mutually independent effect and intrinsic probabilities prare, 1, …, prare, ν, the above formula can be applied with \( {p}_{rare}=1-\prod \limits_{j=1}^{\nu}\left(1-{p}_{rare,j}\right) \). However, due to the assumption of very small allele frequencies, it is unlikely that an individual would carry more than one independent rare variant.
Inclusion of APOE
It may be advantageous to treat a high-effect common variant such as APOE separately from the PRS. The distributions in cases and non-cases of a PRS formed from SNPs excluding APOE can be assumed to be approximately equal for carriers and non-carriers of the APOE risk allele. Considering formulae (5), the probability of disease as a function of PRS will then differ between the groups only due to the higher disease prevalence in carriers of the risk allele. Applying (5) with the disease prevalence for the different APOE genotypes, separate probability curves are obtained. The prevalence in different genotype groups is not usually directly available but can be inferred as follows from the overall prevalence K, the overall allele frequency f and the odds ratio OR for the variant, under the assumption of HWE both in the general population and in the subpopulation of non-cases. This assumption is justified when the disease prevalence in the population is low (e.g. 2% for AD), but problematic when it is high [24] (e.g. major depression 30%). The prevalence K0, K1 and K2 for carriers of non-risk homozygotes, heterozygotes and risk homozygotes, respectively, can be calculated as
$$ {K}_0=1-\frac{{\left(1-f-\nu \right)}^2}{\left(1-K\right){\left(1-f\right)}^2},{K}_1=1-\frac{\left(1-f-\nu \right)\left(\nu +f-K\right)}{\left(1-K\right)f\left(1-f\right)},{K}_2=1-\frac{{\left(\nu +f-K\right)}^2}{\left(1-K\right){f}^2}, $$
where
\( \nu =\frac{b-\sqrt{b^2-16\left(1- OR\right)\left(1-f\right)K}}{4\ \left(1- OR\right)} \) with b = 2 (1 + (1 − OR)(K − f)) (see Supplemental Note 4).
Standardisation of the probability curve
PRSs calculated from different sets of SNPs cannot be directly compared. We therefore standardise the PRS axis by expressing the PRS in terms of standard deviations difference from the population mean,
$$ {x}_{st}=\frac{x-{m}_p}{\sigma_p}, $$
where x is the PRS and xst is the standardised PRS variable.
Simulated and real data
Firstly, we simulated independent genotypes in a sample of 10,000 cases and 10,000 controls and used previously published effect sizes for genome-wide significant SNPs [2, 15]. We calculated an Oligogenic Risk Score (ORS) in the simulated sample using only 39 genome-wide significant SNPs (Supplementary Table 2, adopted from [15]), excluding the APOE proxy SNP (rs429358). The PRS was calculated for 10,039 SNPs, including the above 39 genome-wide significant SNPs and further 10,000 SNPs pruned for LD with r2 = 0.1 and allele frequencies and effect sizes taken from (2).
Secondly, to illustrate the probability of disease in the presence of rare variants, we used effect sizes for rare variants corresponding to the APP, SORL1, TREM2, ABI3 and PLCG2 genes [6, 7, 25]. We used the distribution parameters m0, m1 and \( {\sigma}_0^2 \),\( {\sigma}_1^2 \) for ORS and PRS as reported in [23] and calculated the disease probabilities with the suggested formulae. To demonstrate the APOE modelling with our approach, we also took the distribution parameters of APOE, ORS and PRS from the real case/control study [23] (Supplemental Table 1).
The simulations and probability calculations were implemented with R-statistical software. The codes (Simulations.R and Probability.R) can be downloaded from https://github.com/DRI-Cardiff/AD-probability/.