In ELSA, we have provided the genetic data for detailed analyses of a wide range of age-related traits and outcomes in association with the genetic factors in our large and phenotypically well-characterised sample of older people born in England.
Because there may be some researchers who wish to employ genetic data in their research without access to the necessary tools or knowledge to carry out robust quality control or imputation of the untyped genotypes, we have carried out all of these within the team and made the data available for all to use.
Therefore, several genetic data products derived from our sample are available as detailed below.
If you have any questions about the ELSA genetic data products, contact the ELSA Team and we will be happy to help you.
Directly genotyped data
The genome-wide genotyping was performed at University College London (UCL) Genomics in 2013-2014.
This involved genotyping of 7,597 ELSA participants of European ancestry using the llumina HumanOmni2.5 BeadChips (HumanOmni2.5-4v1, HumanOmni2.5-8v1.3), which measures ~2.5 million markers that capture the genomic variation down to 2.5% minor allele frequency. Genotyping was performed in two batches.
Allele frequencies were compared between the batches after filtering for 5% of missingness. The correlation was calculated between the batches for a number of chromosomes and exceeded 99%.
After post-genotyping quality assurance, such as excluding ethnic outliers (self-reported) and duplicates, the GWAS data was available for total 7,412 ELSA participants and 2,230,767 SNPs.
For more information, contact the ELSA Team.
Quality controlled genetic data
Using methods employed in the Heath and Retirement Study, we carried out robust quality control of the genotyped genetic data in ELSA.
Quality control was performed using PLINK, R and VCFtools. The full QC procedure is depicted in Figure 1.
Samples were removed based on call rate (<0.99), suspected non-European ancestry, autosomal heterozygosity deviation (|Fhet|<0.2), and relatedness. SNPs were excluded if the minor allele frequency (MAF) was <0.01%, if more than 2% of genotype data were missing and if the Hardy-Weinberg Equilibrium (HWE) P-value<10−4.
Non-autosomal markers were also removed. The indels and chromosome X were also excluded.
In total, 7,183 samples (96.9% of 7,412 original cohort) and 1,372,240 (61.5% of 2,230,767) variants remained after quality control.
Figure 1. QC steps that were undertaken as part of quality control in ELSA (Download the image)
To estimate genotypes that were not assayed, imputation was performed on the Michigan Imputation Server5 running SHAPEIT for pre-phasing6, and Minimac3 for imputation7,8 using the Haplotype Reference Consortium (HRC.r1-1.GRCh37)5,9 as the reference panel. All variants aligned to human genome build 19 (hg19).
After imputation, we required very high imputation quality (INFO>0.95), low missingness (<1%) for further quality control. We limited our analyses to variants genotyped or imputed with HWE P-value>10−5. We further applied stringent pruning to remove markers in high linkage disequilibrium (r2>0.1) and excluding high linkage disequilibrium genomic regions. In order to investigate population structure, we chose less correlated SNPs for principal components analysis.
The SNP pruning was performed following the procedure:
i) Consider a window of 50 SNPs
ii) Calculate linkage disequilibrium between each pair of SNPs in the window
iii) Remove one of a pair of SNPs if the linkage disequilibrium is greater than 0.5
iv) Shift the window 5 SNPs forward and v) repeat the procedure.
Altogether, 1,083,252 autosomal SNPs remained after the SNP pruning and were used to run principal components analysis.The top 10 principal components retained to account for any ancestry differences in genetic structures that could potentially bias the results.
After the sample quality control, 7,179,780 variants and 7,183 samples were kept.
Genetic imputation using the reference from 1000 Genomes Project
Prior to the method described above, genetic imputation was carried out using the reference panel from the 1000 Genomes Project.
Polygenic Score Data (PGS)