Scalable probabilistic PCA for large-scale genetic variation data

English version

Autoři: Aman Agrawal ^aff001; Alec M. Chiu ^aff002; Minh Le ^aff003; Eran Halperin ^aff002; Sriram Sankararaman ^aff002; Eran Halperin ^aff003; Sriram Sankararaman ^aff003
Působiště autorů: Department of Computer Science, Indian Institute of Technology, Delhi, India ^aff001; Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America ^aff002; Bioinformatics Interdepartmental Program, University of California, Los Angeles, California United States of America ^aff002; Department of Computer Science, University of California, Los Angeles, California, United States of America ^aff003; Department of Computer Science, University of California, Los Angeles, California United States of America ^aff003; Department of Human Genetics, University of California, Los Angeles, California, United States of America ^aff004; Department of Human Genetics, University of California, Los Angeles, California United States of America ^aff004; Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California, United States of America ^aff005; Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California United States of America ^aff005; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America ^aff006; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California United States of America ^aff006; Institute of Precision Health, University of California, Los Angeles, California, United States of America ^aff007
Vyšlo v časopise: Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008773
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pgen.1008773

Souhrn

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

Klíčová slova:

Algorithms – Genome-wide association studies – Genomic signal processing – Genomics statistics – Molecular genetics – principal component analysis – Singular value decomposition – Variant genotypes

Zdroje

1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annual review of genomics and human genetics. 2011;12 : 245–274. 21801023

2. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko A, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274.

3. Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nature genetics. 2012;44(6):725–731. doi: 10.1038/ng.2285 22610118

4. Baran Y, Quintela I, Carracedo Á, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92(6):882–894. doi: 10.1016/j.ajhg.2013.04.023 23726367

5. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature reviews Genetics. 2010;11(7):459. doi: 10.1038/nrg2813 20548291

6. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2(12):e190+. doi: 10.1371/journal.pgen.0020190 17194218

7. Hanis CL, Chakraborty R, Ferrell RE, Schull WJ. Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican-Americans in Starr County, Texas. Am J Phys Anthropol. 1986;70(4):433–441. 3766713

8. Pritchard J, Stephens M, Donnelly P. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000;155 : 945–959. 10835412

9. Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Molecular Ecology Resources. 2007;7(5):747–756.

10. Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS genetics. 2010;6(9):e1001117. doi: 10.1371/journal.pgen.1001117 20862358

11. Jolliffe IT. Principal Component Analysis and Factor Analysis. In: Principal component analysis. Springer; 1986. p. 115–128.

12. Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. The American Journal of Human Genetics. 2016;98(3):456–472. doi: 10.1016/j.ajhg.2015.12.022 26924531

13. Abraham G, Qiu Y, Inouye M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics. 2017;. 28475694

14. Prive F, Aschard H, Ziyatdinov A, Blum M. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. doi: 10.1093/bioinformatics/bty185 29617937

15. Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics. 2019;35(19):3679–3683. 30957838

16. Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4 : 7. doi: 10.1186/s13742-015-0047-8 25722852

17. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38 : 904–909. 16862161

18. Canela-Xandri O, Law A, Gray A, Woolliams JA, Tenesa A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nature communications. 2015;6 : 10162. doi: 10.1038/ncomms10162 26657010

19. Roweis ST. EM algorithms for PCA and SPCA. In: Advances in neural information processing systems; 1998. p. 626–632.

20. Tipping ME, Bishop CM. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999;61(3):611–622.

21. Liberty E, Zucker SW. The mailman algorithm: A note on matrix–vector multiplication. Information Processing Letters. 2009;109(3):179–182.

22. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562 : 203–209. doi: 10.1038/s41586-018-0579-z 30305743

23. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795 17701901

24. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.

25. Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpar V, Huang J, et al. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human genomics. 2004;1(4):274. doi: 10.1186/1479-7364-1-4-274 15588487

26. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS genetics. 2008;4(1):e4. doi: 10.1371/journal.pgen.0040004 18208329

27. Wiegering A, Ruther U, Gerhardt C. The ciliary protein Rpgrip1l in development and disease. Dev Biol. 2018;442(1):60–68. 30075108

28. Delous M, Baala L, Salomon R, Laclef C, Vierkotten J, Tory K, et al. The ciliary gene RPGRIP1L is mutated in cerebello-oculo-renal syndrome (Joubert syndrome type B) and Meckel syndrome. Nature Genetics. 2007;39 : 875–881. 17558409

29. Devuyst O, Arnould VJ. Mutations in RPGRIP1L: extending the clinical spectrum of ciliopathies. Nephrology Dialysis Transplantation. 2008;23(5):1500–1503.

30. Khanna H, Davis EE, Murga-Zamalloa CA, Estrada-Cuzcano A, Lopez I, den Hollander AI, et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nature Genetics. 2009;41(6):739–45. doi: 10.1038/ng.366 19430481

31. Aschard H, Vilhjálmsson B, Greliche N, Morange P, Trégouët D, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. AJHG. 2014;94(5):662–76.

32. Korneev K, Atretkhany K, Drutskaya M, Grivennikov S, Kuprash D, Nedospasov S. TLR-signaling and proinflammatory cytokines as drivers of tumorigenesis. Cytokine. 2017;89 : 127–135. 26854213

33. Mockenhaupt F, Cramer J, Hamann L, Stegemann M, Eckert J, Oh N, et al. Toll-like receptor (TLR) polymorphisms in African children: Common TLR-4 variants predispose to severe malaria. PNAS. 2006;103(1):177–182. doi: 10.1073/pnas.0506803102 16371473

34. Van der Graaf C, Netea M, Morré S, Den Heijer M, Verweij P, Van der Meer J, et al. Toll-like receptor 4 Asp299Gly/Thr399Ile polymorphisms are a risk factor for Candida bloodstream infection. European Cytokine Network. 2006;17(1):29–34. 16613760

35. Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, et al. Detection of human adaptation during the past 2000 years. Science. 2016;354(6313):760–764. doi: 10.1126/science.aag0776 27738015

36. Albers, McVean. Dating genomic variants and shared ancestry in population-scale sequencing data. bioRxiv. 2019.

37. Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–i194. doi: 10.1093/bioinformatics/bty253 29950019

38. Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review. 2011;53(2):217–288.

39. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature genetics. 2012;44(3):243. doi: 10.1038/ng.1074 22306651

40. Hellenthal G, Auton A, Falush D. Inferring Human Colonization History Using a Copying Model. PLoS Genet. 2008;4(5):e1000078. doi: 10.1371/journal.pgen.1000078 18497854

41. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233. 14704198

42. Wen X, Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics. 2010;4(3):1158. 21479081

43. Schein AI, Saul LK, Ungar LH. A generalized linear model for principal component analysis of binary data. In: AISTATS. vol. 3; 2003. p. 10.

44. Li W, Cerise J, Yang Y, Han H. Application of t-SNE to human genetic data. J Bioinform Comput Biol. 2017;15(4):1750017. 28718343

45. Becht E, McInnes L, Healy J, Dutertre C, Kwok I, Ng L, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37 : 38–44.

46. Anderson TW, Rubin H. Statistical inference in factor analysis. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability. vol. 5; 1956. p. 111–150.

47. Szlam A, Tulloch A, Tygert M. Accurate Low-Rank Approximations Via a Few Iterations of Alternating Least Squares. SIAM Journal on Matrix Analysis and Applications. 2017;38(2):425–433.

48. Lehoucq RB, Sorensen DC. Deflation techniques for an implicitly restarted Arnoldi iteration. SIAM Journal on Matrix Analysis and Applications. 1996;17(4):789–821.

49. Manichaikul A, Mychaleckyj J, Rich S, Daly K, Sale M, Chen W. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–2873. doi: 10.1093/bioinformatics/btq559 20926424

Článek A cross-disorder PRS-pheWAS of 5 major psychiatric disorders in UK Biobank

Článek Depletion of Ric-8B leads to reduced mTORC2 activity

Článek A copy number variant is associated with a spectrum of pigmentation patterns in the rock pigeon (Columba livia)

Článek An osteocalcin-deficient mouse strain without endocrine abnormalities

Článek Osteocalcin is necessary for the alignment of apatite crystallites, but not glucose metabolism, testosterone synthesis, or muscle mass

Článek Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model

Článek Accounting for long-range correlations in genome-wide simulations of large cohorts

Článek Novel frameshift variant in MYL2 reveals molecular differences between dominant and recessive forms of hypertrophic cardiomyopathy

Článek The domesticated transposase ALP2 mediates formation of a novel Polycomb protein complex by direct interaction with MSI1, a core subunit of Polycomb Repressive Complex 2 (PRC2)

Článek Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma

Článek The phosphorelay BarA/SirA activates the non-cognate regulator RcsB in Salmonella enterica

Článek Copy number variants and fixed duplications among 198 rhesus macaques (Macaca mulatta)