Using methods employed in the Heath and Retirement Study,(https://hrs.isr.umich.edu/data-products/genetic-data) we carried out robust quality control of the genotyped genetic data in ELSA.
In total, 7,183 samples (96.9% of 7,412 original cohort) and 1,372,240 (61.5% of 2,230,767) variants remained after quality control.
These data are for researchers who would like to utilise directly genotyped genome-wide genotypes but do not have access to the necessary tools or knowledge to carry out the robust quality controls necessary for these types of studies. We have carried out quality controls and quality assurance of our GWAS data and made these available to all ELSA data users.
Access
These data are accessed directly from the ELSA Team.(mailto:elsa@ucl.ac.uk)
In detail
Quality control was performed using PLINK [1], R and VCFtools [2]. The full QC procedure is depicted in Figure 1.
QC based on individual level. The samples for whom the recorded sex phenotype was inconsistent with genetic sex were removed. Duplicated samples and cryptic relatedness between each pair of participants was evaluated using pairwise genome-wide estimates of three coefficients corresponding to the probabilities of sharing 0, 1 or 2 alleles between two individuals that are identical by descent [3]. There are two methods for estimating the identical by descent (IBD) probabilities - method of moments and method of maximum likelihood. Both methods have been shown to give very similar results [4]; thus, we report results from method of moments implemented in PLINK 1.9 [5].
IBD were estimated using autosomal SNPs where IBD=1 highlights presence of duplicates or monozygotic twins, IBD=0.5 shows that first-degree relatives are present in the sample, IBD=0.25 and IBD=0.125 highlights presence of second-degree and third-degree relatives, respectively [6]. Owing to genotyping error, linkage disequilibrium (LD) and population structure, it is expected to observe some variations around these theoretical values.
Therefore, it is normal to remove one individual from each pair with an IBD value of >0.2, which is halfway between the expected IBD for third- and second-degree relatives [7]. We identified individuals with an IBD value of >0.2 and excluded one of each pair at random.
QC based on SNP level. Heterozygosity refers to carrying of two different alleles of a specific SNP. Excessive heterozygosity may imply a sample contamination, while less heterozygosity than expected may imply inbreeding [7].
In ELSA, the checks for heterozygosity were performed on a set of SNPs which were non-(highly) correlated. To generate a list of non-(highly) correlated SNPs, we excluded four regions that are known to contain clusters of highly correlated SNPs. These were the Lactase Gene (LCT) (chromosome 6, 12578740 to 135837195 bp), human leukocyte antigen (HLA) (chromosome 2, 2550000 to 3350000 bp) and two inversion regions located on 8p23.1 (chromosome 8, 81305000 to 1200000 bp) and 17q21.31 (chromosome 17, 40900000-45000000 bp) 8. We then pruned the SNPs using the ‘10 5 0.1’ parameters.
These pruning parameters use a sliding window method that considers blocks of 10 SNPs and removes SNPs with r2 >0.10 afterward shifting the window by 5 SNPs. Those individuals with extremely low or high heterozygosity score (>3 standard deviations from the mean) were removed. Further, the genotyped data with a call rate of <98% was removed. SNPs in sex chromosomes and SNPs with a minor allele frequency (MAF) of <0.01 were excluded. SNPs whose genotype distributions deviated significantly from the Hardy-Weinberg equilibrium (HWE) (p<10-4) and with missingness <0.02 were also removed.
Finally, to ensure a large overlap between the GWAS summary statistics (i.e., base file) and the ELSA (i.e., target) data, we have converted all present platform specific ids (i.e., kgps) to rsids. However, not all kgps were able to be successfully updated; those SNPs for which the kgps were not updated were removed.
Population structure. To investigate population structure, we use principal components analysis (PCA)9 implemented in PLINK 1.9 [5]. We used the PCA approach with two aims; first, to identify those individuals who deviated from the ethnic population they self-reported to be (i.e., ethnic outliers), and second, to provide sample eigenvectors which will then be used for adjusting for possible population stratification in the association analyses [9,10].
It has been shown that in PCA, the usefulness of certain principal components (PCs) may be limited by clusters of highly correlated SNPs at specific locations, such as the LCT, HLA, 8p23.1 and 17q21.31 [4,8] in whole-genome arrays [8]. To address this pitfall, the SNPs that were used in PCA were selected by LD pruning from an initial pool consisting of all autosomal SNPs with a missing call rate <5% and MAF >5%. In addition, the 2q21 (LCT), HLA, 8p23, and 17q21.31 regions were excluded from this initial pool.
The LD pruning process, using all unrelated ELSA participants selected 147,070 SNPs with all pairs having r2 <0.1 in a sliding 10 Mb window. PCs were obtained using PLINK software; we retained the top 10 PCs to account for any ancestry differences in genetic structures that could bias results [9]. Initially, we performed PCA on all study subjects; however, the visual inspection of the PCs distribution highlighted the present of ancestral admixture in the 65 individuals. We removed these outliers and re-calculated PCs using the updated samples.
Download the detailed report(https://www.ucl.ac.uk/epidemiology-health-care/sites/epidemiology_health_care/files/pgs_2020_qc.pdf)
Figure 1. QC steps that were undertaken as part of quality control in ELSA (Download the image)(https://www.ucl.ac.uk/epidemiology-health-care/sites/epidemiology_health_care/files/quality_control_in_elsa_figure1.pdf)
https://static.wixstatic.com/media/540eba_367b2ecf1f9842edb1936b35c0658851~mv2.png
References:
1 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559-575, doi:10.1086/519795 (2007).
2 Danecek, P. et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 27, 2156-2158, doi:10.1093/bioinformatics/btr330 (2011).
3 Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res 21, 768-774, doi:10.1101/gr.115972.110 (2011).
4 Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol 34, 591-602, doi:10.1002/gepi.20516 (2010).
5 Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7, doi:10.1186/s13742-015-0047-8 (2015).
6 Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564-1573, doi:10.1038/nprot.2010.116 (2010).
7 Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res 27, e1608, doi:10.1002/mpr.1608 (2018).
8 Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101, doi:10.1038/nature07331 (2008).
9 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006).
10 Wang, D. et al. Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: principal-component analysis versus multidimensional scaling. BMC Proc 3 Suppl 7, S109 (2009).
11 Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer's disease identifies new risk loci and implicates Abeta, tau, immunity and lipid processing. Nature genetics 51, 414-430, doi:10.1038/s41588-019-0358-2 (2019).