In ELSA, we have provided the genetic data for detailed analyses of a wide range of age-related traits and outcomes in association with the genetic factors in our large and phenotypically well-characterised sample of older people born in England.
Because there may be some researchers who wish to employ genetic data in their research without access to the necessary tools or knowledge to carry out robust quality control or imputation of the untyped genotypes, we have carried out all of these within the team and made the data available for all to use.
Therefore, several genetic data products derived from our sample are available as detailed below.
If you have any questions about the ELSA genetic data products, contact the ELSA Team and we will be happy to help you.
Directly genotyped data
The genome-wide genotyping was performed at University College London (UCL) Genomics in 2013-2014, funded by the ESRC.
This involved genotyping of 7,597 ELSA participants of European ancestry using the llumina HumanOmni2.5 BeadChips (HumanOmni2.5-4v1, HumanOmni2.5-8v1.3), which measures ~2.5 million markers that capture the genomic variation down to 2.5% minor allele frequency. Genotyping was performed in two batches.
Allele frequencies were compared between the batches after filtering for 5% of missingness. The correlation was calculated between the batches for a number of chromosomes and exceeded 99%.
After post-genotyping quality assurance, such as excluding ethnic outliers (self-reported) and duplicates, the GWAS data was available for total 7,412 ELSA participants and 2,230,767 SNPs.
For more information, contact the ELSA Team.
Quality controlled genetic data
Using methods employed in the Heath and Retirement Study, we carried out robust quality control of the genotyped genetic data in ELSA.
Quality control was performed using PLINK , R and VCFtools . The full QC procedure is depicted in Figure 1.
QC based on individual level. The samples for whom the recorded sex phenotype was inconsistent with genetic sex were removed. Duplicated samples and cryptic relatedness between each pair of participants was evaluated using pairwise genome-wide estimates of three coefficients corresponding to the probabilities of sharing 0, 1 or 2 alleles between two individuals that are identical by descent . There are two methods for estimating the identical by descent (IBD) probabilities - method of moments and method of maximum likelihood. Both methods have been shown to give very similar results ; thus, we report results from method of moments implemented in PLINK 1.9 .
IBD were estimated using autosomal SNPs where IBD=1 highlights presence of duplicates or monozygotic twins, IBD=0.5 shows that first-degree relatives are present in the sample, IBD=0.25 and IBD=0.125 highlights presence of second-degree and third-degree relatives, respectively . Owing to genotyping error, linkage disequilibrium (LD) and population structure, it is expected to observe some variations around these theoretical values.
Therefore, it is normal to remove one individual from each pair with an IBD value of >0.2, which is halfway between the expected IBD for third- and second-degree relatives . We identified individuals with an IBD value of >0.2 and excluded one of each pair at random.
QC based on SNP level. Heterozygosity refers to carrying of two different alleles of a specific SNP. Excessive heterozygosity may imply a sample contamination, while less heterozygosity than expected may imply inbreeding .
In ELSA, the checks for heterozygosity were performed on a set of SNPs which were non-(highly) correlated. To generate a list of non-(highly) correlated SNPs, we excluded four regions that are known to contain clusters of highly correlated SNPs. These were the Lactase Gene (LCT) (chromosome 6, 12578740 to 135837195 bp), human leukocyte antigen (HLA) (chromosome 2, 2550000 to 3350000 bp) and two inversion regions located on 8p23.1 (chromosome 8, 81305000 to 1200000 bp) and 17q21.31 (chromosome 17, 40900000-45000000 bp) 8. We then pruned the SNPs using the ‘10 5 0.1’ parameters.
These pruning parameters use a sliding window method that considers blocks of 10 SNPs and removes SNPs with r2 >0.10 afterward shifting the window by 5 SNPs. Those individuals with extremely low or high heterozygosity score (>3 standard deviations from the mean) were removed. Further, the genotyped data with a call rate of <98% was removed. SNPs in sex chromosomes and SNPs with a minor allele frequency (MAF) of <0.01 were excluded. SNPs whose genotype distributions deviated significantly from the Hardy-Weinberg equilibrium (HWE) (p<10-4) and with missingness <0.02 were also removed.
Finally, to ensure a large overlap between the GWAS summary statistics (i.e., base file) and the ELSA (i.e., target) data, we have converted all present platform specific ids (i.e., kgps) to rsids. However, not all kgps were able to be successfully updated; those SNPs for which the kgps were not updated were removed.
Population structure. To investigate population structure, we use principal components analysis (PCA)9 implemented in PLINK 1.9 . We used the PCA approach with two aims; first, to identify those individuals who deviated from the ethnic population they self-reported to be (i.e., ethnic outliers), and second, to provide sample eigenvectors which will then be used for adjusting for possible population stratification in the association analyses [9,10].
It has been shown that in PCA, the usefulness of certain principal components (PCs) may be limited by clusters of highly correlated SNPs at specific locations, such as the LCT, HLA, 8p23.1 and 17q21.31 [4,8] in whole-genome arrays . To address this pitfall, the SNPs that were used in PCA were selected by LD pruning from an initial pool consisting of all autosomal SNPs with a missing call rate <5% and MAF >5%. In addition, the 2q21 (LCT), HLA, 8p23, and 17q21.31 regions were excluded from this initial pool.
The LD pruning process, using all unrelated ELSA participants selected 147,070 SNPs with all pairs having r2 <0.1 in a sliding 10 Mb window. PCs were obtained using PLINK software; we retained the top 10 PCs to account for any ancestry differences in genetic structures that could bias results . Initially, we performed PCA on all study subjects; however, the visual inspection of the PCs distribution highlighted the present of ancestral admixture in the 65 individuals. We removed these outliers and re-calculated PCs using the updated samples.
Figure 1. QC steps that were undertaken as part of quality control in ELSA (Download the image)
1 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559-575, doi:10.1086/519795 (2007).
2 Danecek, P. et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 27, 2156-2158, doi:10.1093/bioinformatics/btr330 (2011).
3 Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res 21, 768-774, doi:10.1101/gr.115972.110 (2011).
4 Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol 34, 591-602, doi:10.1002/gepi.20516 (2010).
5 Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7, doi:10.1186/s13742-015-0047-8 (2015).
6 Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564-1573, doi:10.1038/nprot.2010.116 (2010).
7 Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res 27, e1608, doi:10.1002/mpr.1608 (2018).
8 Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101, doi:10.1038/nature07331 (2008).
9 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).
10 Wang, D. et al. Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: principal-component analysis versus multidimensional scaling. BMC Proc 3 Suppl 7, S109 (2009).
11 Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer's disease identifies new risk loci and implicates Abeta, tau, immunity and lipid processing. Nature genetics 51, 414-430, doi:10.1038/s41588-019-0358-2 (2019).
To estimate genotypes that were not assayed, imputation was performed on the Michigan Imputation Server  running SHAPEIT for pre-phasing , and Minimac3 for imputation [3,4] using the Haplotype Reference Consortium (HRC.r1-1.GRCh37) [1,5] as the reference panel. All variants aligned to human genome build 19 (hg19).
After imputation, we required very high imputation quality (INFO>0.95), low missingness (<1%) for further quality control. We limited our analyses to variants genotyped or imputed with HWE P-value>10− . We further applied stringent pruning to remove markers in high linkage disequilibrium (r2>0.1) and excluding high linkage disequilibrium genomic regions. In order to investigate population structure, we chose less correlated SNPs for principal components analysis.
The SNP pruning was performed following the procedure:
i) Consider a window of 50 SNPs
ii) Calculate linkage disequilibrium between each pair of SNPs in the window
iii) Remove one of a pair of SNPs if the linkage disequilibrium is greater than 0.5
iv) Shift the window 5 SNPs forward and
v) Repeat the procedure.
Altogether, 1,083,252 autosomal SNPs remained after the SNP pruning and were used to run principal components analysis. The top 10 principal components retained to account for any ancestry differences in genetic structures that could potentially bias the results .
After the sample quality control, 7,179,780 variants and 7,183 samples were kept.
Genetic imputation using the reference from 1000 Genomes Project
Prior to the method described above, genetic imputation was carried out using the reference panel from the 1000 Genomes Project.
1 Das, S. et al. Next-generation genotype imputation service and methods. Nat Genet 48, 1284-1287, doi:10.1038/ng.3656 (2016).
2 Delaneau, O., Zagury, J. F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nature methods 10, 5-6, doi:10.1038/nmeth.2307 (2013).
3 Fuchsberger, C., Abecasis, G. R. & Hinds, D. A. minimac2: faster genotype imputation. Bioinformatics (Oxford, England) 31, 782-784, doi:10.1093/bioinformatics/btu704 (2015).
4 Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics 44, 955-959, doi:10.1038/ng.2354 (2012).
5 McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics 48, 1279-1283, doi:10.1038/ng.3643 (2016).
6 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).
Polygenic Score Data (PGS)
Polygenic scores (PGSs) are usually constructed from a weighted sum of allelic count [1-3] and are presented as continuous scores. They are specific to each individual and represent an individual load for the common variants that are associated with a trait under study. PG scores are increasingly used to predict disease risks . This is usually done through linear regression analyses where the PGS for a given trait is used a predictor for an outcome adjusting the analyses for various covariates, which usually age, gender and principal components to account for any ancestry differences in genetic structures that could bias results .
Another popular way if using the PGSS is to derive a binary predictor from the continuous PGS, where the top 10% or 20% of the PGS is coded as “high risk” group and the remaining is coded as “low risk” group based on an individual loading for the common SNPs. In turn, genomic prediction of disease risks might have implications in designing more individualised preventive or screening strategies for patients . For example, earlier screening for breast cancer may be warranted for those having a high genetic risk for the disease as measuring the PGS .
Furthermore, PGSs have been shown to be suitable for a number of scientific aims beyond the risk prediction including identification of shared aetiology among traits using such an analytical tool as GCTA (Genome-wide Complex Trait Analysis) , testing for genome-wide G*E and G*G interactions , Mendelian Randomisation to infer causal relationships, and for patient stratification and sub-phenotyping [5,8].
Thus, PGSs represent not only an individual genetic prediction of phenotypes but open possibilities for interrogating a wide range of hypotheses via association testing.
In ELSA, the methods employed for creating PGSs are those outlined by the Health and Retirement Study (HRS) . This was done in order to harmonise the research across age-related longitudinal studies by adopting a consistent methodology for creating PGSs. By making these PGSs publicly available, it is hoped that they will facilitate wide use among the ELSA data users.
PGSs for each phenotype are based on a single, replicated genome-wide association study (GWAS). These scores will be updated as sufficiently large GWAS are published for new phenotypes or as updated meta-analyses for existing phenotypes are released. Polygenic Score Data and detailed report describing the methods employed and full list of PGSs available in ELSA can be found here.
For any other questions related to PGSs in ELSA, please please contact the ELSA Team.
Polygenic index (PGIs)
As a resource for researchers, Becker et al (2021) used a consistent methodology to construct PGIs for 47 phenotypes in 11 datasets. To maximize the PGIs’ prediction accuracies, PGIs were constructed using genome-wide association studies—some of which are novel—from multiple data sources, including 23andMe and UK Biobank.
Anyone wishing to utilise these PGIs as described in the paper (excluding the data from the ELSA study), will be able to download the scores with corresponding documents at the end of February 2021. This page will be updated as soon as they are available.
1 So, H. C. & Sham, P. C. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Scientific reports 7, 41262, doi:10.1038/srep41262 (2017).
2 Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748-752, doi:10.1038/nature08185 (2009).
3 Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet 9, e1003348, doi:10.1371/journal.pgen.1003348 (2013).
4 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).
5 Mavaddat, N. et al. Prediction of breast cancer risk based on profiling with common genetic variants. Journal of the National Cancer Institute 107, doi:10.1093/jnci/djv036 (2015).
6 Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics 88, 76-82, doi:10.1016/j.ajhg.2010.11.011 (2011).
7 Mullins, N. et al. Polygenic interactions with environmental adversity in the aetiology of major depressive disorder. Psychological medicine 46, 759-770, doi:10.1017/s0033291715002172 (2016).
8 Natarajan, P. et al. Polygenic Risk Score Identifies Subgroup With Higher Burden of Atherosclerosis and Greater Relative Benefit From Statin Therapy in the Primary Prevention Setting. Circulation 135, 2091-2101, doi:10.1161/circulationaha.116.024436 (2017).
9 Wray, N. R. et al. Research review: Polygenic methods and their application to psychiatric traits. Journal of child psychology and psychiatry, and allied disciplines 55, 1068-1087, doi:10.1111/jcpp.12295 (2014).
10 Ware EB et al. Method of Construction Affects Polygenic Score Prediction of Common Human Trait. BiorXiv, 1-13 (2017).
Genetic data linked to phenotypic data
Information coming soon.