Genetics

In ELSA, we have provided the genetic data for detailed analyses of a wide range of age-related traits and outcomes in association with the genetic factors in our large and phenotypically well-characterised sample of older people born in England.

Because there may be some researchers who wish to employ genetic data in their research without access to the necessary tools or knowledge to carry out robust quality control or imputation of the untyped genotypes, we have carried out all of these within the team and made the data available for all to use.

Therefore, several genetic data products derived from our sample are available as detailed below.

If you have any questions about the ELSA genetic data products, contact the ELSA Team and we will be happy to help you.

Available data

Directly genotyped data for 7,412 ELSA participants and 2,230,767 SNPs.

These data are for researchers who wish to use directly genotyped genome-wide genotypes without imputation or wihtout having to undertake specific quality control steps.

Having access to raw genotypes gives researchers an opportunity to employ their own quality-control steps and choose their preferred methods for imputing genotypes that were not assayed.

Access

These data are accessed via the European Genome-phenome Archive (EGA).

In detail

The genome-wide genotyping was performed at University College London (UCL) Genomics in 2013-2014, funded by the ESRC.

This involved genotyping of 7,597 ELSA participants of European ancestry using the llumina HumanOmni2.5 BeadChips (HumanOmni2.5-4v1, HumanOmni2.5-8v1.3), which measures ~2.5 million markers that capture the genomic variation down to 2.5% minor allele frequency. Genotyping was performed in two batches.

Allele frequencies were compared between the batches after filtering for 5% of missingness. The correlation was calculated between the batches for a number of chromosomes and exceeded 99%.

After post-genotyping quality assurance, such as excluding ethnic outliers (self-reported) and duplicates, the GWAS data was available for total 7,412 ELSA participants and 2,230,767 SNPs.

For more information, contact the ELSA Team.

Using methods employed in the Heath and Retirement Study, we carried out robust quality control of the genotyped genetic data in ELSA.

In total, 7,183 samples (96.9% of 7,412 original cohort) and 1,372,240 (61.5% of 2,230,767) variants remained after quality control.

These data are for researchers who would like to utilise directly genotyped genome-wide genotypes but do not have access to the necessary tools or knowledge to carry out the robust quality controls necessary for these types of studies. We have carried out quality controls and quality assurance of our GWAS data and made these available to all ELSA data users.

Access

These data are accessed directly from the ELSA Team.

In detail

Quality control was performed using PLINK [1], R and VCFtools [2]. The full QC procedure is depicted in Figure 1.

QC based on individual level. The samples for whom the recorded sex phenotype was inconsistent with genetic sex were removed. Duplicated samples and cryptic relatedness between each pair of participants was evaluated using pairwise genome-wide estimates of three coefficients corresponding to the probabilities of sharing 0, 1 or 2 alleles between two individuals that are identical by descent [3]. There are two methods for estimating the identical by descent (IBD) probabilities - method of moments and method of maximum likelihood. Both methods have been shown to give very similar results [4]; thus, we report results from method of moments implemented in PLINK 1.9 [5].

IBD were estimated using autosomal SNPs where IBD=1 highlights presence of duplicates or monozygotic twins, IBD=0.5 shows that first-degree relatives are present in the sample, IBD=0.25 and IBD=0.125 highlights presence of second-degree and third-degree relatives, respectively [6]. Owing to genotyping error, linkage disequilibrium (LD) and population structure, it is expected to observe some variations around these theoretical values.

Therefore, it is normal to remove one individual from each pair with an IBD value of >0.2, which is halfway between the expected IBD for third- and second-degree relatives [7]. We identified individuals with an IBD value of >0.2 and excluded one of each pair at random.

QC based on SNP level. Heterozygosity refers to carrying of two different alleles of a specific SNP. Excessive heterozygosity may imply a sample contamination, while less heterozygosity than expected may imply inbreeding [7].

In ELSA, the checks for heterozygosity were performed on a set of SNPs which were non-(highly) correlated. To generate a list of non-(highly) correlated SNPs, we excluded four regions that are known to contain clusters of highly correlated SNPs. These were the Lactase Gene (LCT) (chromosome 6, 12578740 to 135837195 bp), human leukocyte antigen (HLA) (chromosome 2, 2550000 to 3350000 bp) and two inversion regions located on 8p23.1 (chromosome 8, 81305000 to 1200000 bp) and 17q21.31 (chromosome 17, 40900000-45000000 bp) 8. We then pruned the SNPs using the ‘10 5 0.1’ parameters.

These pruning parameters use a sliding window method that considers blocks of 10 SNPs and removes SNPs with r2 >0.10 afterward shifting the window by 5 SNPs. Those individuals with extremely low or high heterozygosity score (>3 standard deviations from the mean) were removed. Further, the genotyped data with a call rate of <98% was removed. SNPs in sex chromosomes and SNPs with a minor allele frequency (MAF) of <0.01 were excluded. SNPs whose genotype distributions deviated significantly from the Hardy-Weinberg equilibrium (HWE) (p<10-4) and with missingness <0.02 were also removed.

Finally, to ensure a large overlap between the GWAS summary statistics (i.e., base file) and the ELSA (i.e., target) data, we have converted all present platform specific ids (i.e., kgps) to rsids. However, not all kgps were able to be successfully updated; those SNPs for which the kgps were not updated were removed.

Population structure. To investigate population structure, we use principal components analysis (PCA)9 implemented in PLINK 1.9 [5]. We used the PCA approach with two aims; first, to identify those individuals who deviated from the ethnic population they self-reported to be (i.e., ethnic outliers), and second, to provide sample eigenvectors which will then be used for adjusting for possible population stratification in the association analyses [9,10].

It has been shown that in PCA, the usefulness of certain principal components (PCs) may be limited by clusters of highly correlated SNPs at specific locations, such as the LCT, HLA, 8p23.1 and 17q21.31 [4,8] in whole-genome arrays [8]. To address this pitfall, the SNPs that were used in PCA were selected by LD pruning from an initial pool consisting of all autosomal SNPs with a missing call rate <5% and MAF >5%. In addition, the 2q21 (LCT), HLA, 8p23, and 17q21.31 regions were excluded from this initial pool.

The LD pruning process, using all unrelated ELSA participants selected 147,070 SNPs with all pairs having r2 <0.1 in a sliding 10 Mb window. PCs were obtained using PLINK software; we retained the top 10 PCs to account for any ancestry differences in genetic structures that could bias results [9]. Initially, we performed PCA on all study subjects; however, the visual inspection of the PCs distribution highlighted the present of ancestral admixture in the 65 individuals. We removed these outliers and re-calculated PCs using the updated samples.

Download the detailed report

Figure 1. QC steps that were undertaken as part of quality control in ELSA (Download the image)

References:

1 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559-575, doi:10.1086/519795 (2007).

2 Danecek, P. et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 27, 2156-2158, doi:10.1093/bioinformatics/btr330 (2011).

3 Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res 21, 768-774, doi:10.1101/gr.115972.110 (2011).

4 Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol 34, 591-602, doi:10.1002/gepi.20516 (2010).

5 Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7, doi:10.1186/s13742-015-0047-8 (2015).

6 Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564-1573, doi:10.1038/nprot.2010.116 (2010).

7 Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res 27, e1608, doi:10.1002/mpr.1608 (2018).

8 Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101, doi:10.1038/nature07331 (2008).

9 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).

10 Wang, D. et al. Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: principal-component analysis versus multidimensional scaling. BMC Proc 3 Suppl 7, S109 (2009).

11 Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer's disease identifies new risk loci and implicates Abeta, tau, immunity and lipid processing. Nature genetics 51, 414-430, doi:10.1038/s41588-019-0358-2 (2019).

Data available

Polygenic Score Data (PGS) are available for a number of behavioural, emotional and health-related phenotypes.

Access

These data are available to download from the UK Data Service: https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8773

In detail

Polygenic scores (PGSs) are usually constructed from a weighted sum of allelic count [1-3] and are presented as continuous scores. They are specific to each individual and represent an individual load for the common variants that are associated with a trait under study. PG scores are increasingly used to predict disease risks [1]. This is usually done through linear regression analyses where the PGS for a given trait is used a predictor for an outcome adjusting the analyses for various covariates, which usually age, gender and principal components to account for any ancestry differences in genetic structures that could bias results [4].

Another popular way if using the PGSS is to derive a binary predictor from the continuous PGS, where the top 10% or 20% of the PGS is coded as “high risk” group and the remaining is coded as “low risk” group based on an individual loading for the common SNPs. In turn, genomic prediction of disease risks might have implications in designing more individualised preventive or screening strategies for patients [1]. For example, earlier screening for breast cancer may be warranted for those having a high genetic risk for the disease as measuring the PGS [5].

Furthermore, PGSs have been shown to be suitable for a number of scientific aims beyond the risk prediction including identification of shared aetiology among traits using such an analytical tool as GCTA (Genome-wide Complex Trait Analysis) [6], testing for genome-wide G*E and G*G interactions [7], Mendelian Randomisation to infer causal relationships, and for patient stratification and sub-phenotyping [5,8].

Thus, PGSs represent not only an individual genetic prediction of phenotypes but open possibilities for interrogating a wide range of hypotheses via association testing.

Polygenic scores

In ELSA, the methods employed for creating PGSs are those outlined by the Health and Retirement Study (HRS) [10]. This was done in order to harmonise the research across age-related longitudinal studies by adopting a consistent methodology for creating PGSs. By making these PGSs publicly available, it is hoped that they will facilitate wide use among the ELSA data users.

PGSs for each phenotype are based on a single, replicated genome-wide association study (GWAS). These scores will be updated as sufficiently large GWAS are published for new phenotypes or as updated meta-analyses for existing phenotypes are released. Polygenic Score Data and detailed report describing the methods employed and full list of PGSs available in ELSA can be found on the UKDS website:

https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8773

For any other questions related to PGSs in ELSA, please contact the ELSA Team.

Polygenic index (PGIs) as presented in Becker et al (2021) Nat Hum Behav

As a resource for researchers, Dr Aysu Okbay, Professor Daniel Benjamin, Professor David Cesarini, and Dr Patrick Turley, in collaboration with colleagues from 11 cohorts, used a consistent methodology to construct Polygenic indexes (PGIs) for 35 phenotypes in 11 datasets. To maximize the PGIs’ prediction accuracies, PGIs were constructed using genome-wide association studies—some of which are novel—from multiple data sources, including 23andMe and UK Biobank. It is hoped that these PGIs would present a theoretical framework to help interpret analyses involving PGIs; thus, the PGIs for 35 phenotypes excluding the data from the ELSA study along with a user guide.

These data contain: 36 single-trait PGIs, 35 multi-trait PGIs, and 20 PCs. Information on how these PGIs have been derived can be found here: Becker J et.al. Resource profile and user guide of the Polygenic Index Repository. Nat Hum Behav. 2021 Dec;5(12):1744-1758. doi: 10.1038/s41562-021-01119-3. Epub 2021 Jun 17. PMID: 34140656; PMCID: PMC8678380.Please note that in this dataset, the PGI for education attainment (single trait) is now based on EA4 (Okbayet al., 2022) rather than EA3 (Lee et al. 2018) and following EA4, we made it using SBayesR instead of LDpred.

Polygenic index (PGIs) Datafile

References:

1 So, H. C. & Sham, P. C. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Scientific reports 7, 41262, doi:10.1038/srep41262 (2017).

2 Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748-752, doi:10.1038/nature08185 (2009).

3 Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet 9, e1003348, doi:10.1371/journal.pgen.1003348 (2013).

4 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).

5 Mavaddat, N. et al. Prediction of breast cancer risk based on profiling with common genetic variants. Journal of the National Cancer Institute 107, doi:10.1093/jnci/djv036 (2015).

6 Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics 88, 76-82, doi:10.1016/j.ajhg.2010.11.011 (2011).

7 Mullins, N. et al. Polygenic interactions with environmental adversity in the aetiology of major depressive disorder. Psychological medicine 46, 759-770, doi:10.1017/s0033291715002172 (2016).

8 Natarajan, P. et al. Polygenic Risk Score Identifies Subgroup With Higher Burden of Atherosclerosis and Greater Relative Benefit From Statin Therapy in the Primary Prevention Setting. Circulation 135, 2091-2101, doi:10.1161/circulationaha.116.024436 (2017).

9 Wray, N. R. et al. Research review: Polygenic methods and their application to psychiatric traits. Journal of child psychology and psychiatry, and allied disciplines 55, 1068-1087, doi:10.1111/jcpp.12295 (2014).

10 Ware EB et al. Method of Construction Affects Polygenic Score Prediction of Common Human Trait. BiorXiv, 1-13 (2017).

Data available

For those researchers who wish to utilise the cleaned genetic data with genotypes estimated through imputation that were not assayed, we conducted imputation of the untyped genotypes using the Haplotype Reference Consortium (HRC.r1-1.GRCh37) as the reference panel and subsequently carried out robust quality control of the imputed data.

Access

Imputed data, either raw or quality controlled, are available directly from the ELSA team.

In detail

To estimate genotypes that were not assayed, imputation was performed on the Michigan Imputation Server [1] running SHAPEIT for pre-phasing [2], and Minimac3 for imputation [3,4] using the Haplotype Reference Consortium (HRC.r1-1.GRCh37) [1,5] as the reference panel. All variants aligned to human genome build 19 (hg19).

After imputation, we required very high imputation quality (INFO>0.95), low missingness (<1%) for further quality control. We limited our analyses to variants genotyped or imputed with HWE P-value>10− [5]. We further applied stringent pruning to remove markers in high linkage disequilibrium (r2>0.1) and excluding high linkage disequilibrium genomic regions. In order to investigate population structure, we chose less correlated SNPs for principal components analysis.

The SNP pruning was performed following the procedure:

i) Consider a window of 50 SNPs

ii) Calculate linkage disequilibrium between each pair of SNPs in the window

iii) Remove one of a pair of SNPs if the linkage disequilibrium is greater than 0.5

iv) Shift the window 5 SNPs forward and

v) Repeat the procedure.

Altogether, 1,083,252 autosomal SNPs remained after the SNP pruning and were used to run principal components analysis. The top 10 principal components retained to account for any ancestry differences in genetic structures that could potentially bias the results [6].

After the sample quality control, 7,179,780 variants and 7,183 samples were kept.

More detail can be downloaded here

Genetic imputation using the reference from 1000 Genomes Project

Prior to the method described above, genetic imputation was carried out using the reference panel from the 1000 Genomes Project.

Download the report

References

1 Das, S. et al. Next-generation genotype imputation service and methods. Nat Genet 48, 1284-1287, doi:10.1038/ng.3656 (2016).

2 Delaneau, O., Zagury, J. F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nature methods 10, 5-6, doi:10.1038/nmeth.2307 (2013).

3 Fuchsberger, C., Abecasis, G. R. & Hinds, D. A. minimac2: faster genotype imputation. Bioinformatics (Oxford, England) 31, 782-784, doi:10.1093/bioinformatics/btu704 (2015).

4 Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics 44, 955-959, doi:10.1038/ng.2354 (2012).

5 McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics 48, 1279-1283, doi:10.1038/ng.3643 (2016).

6 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).

Professor James Banks

Professor James Banks

Professor David Batty

Professor David Batty

Kate Coughlin

Kate Coughlin

Genetics

Directly genotyped data

Quality controlled genetic data

Polygenic Score Data (PGS)

Genetic imputation

Genetic data linked to phenotypic data