Scalable probabilistic PCA for large-scale genetic variation data

Autoři: Aman Agrawal aff001;  Alec M. Chiu aff002;  Minh Le aff003;  Eran Halperin aff002;  Sriram Sankararaman aff002;  Eran Halperin aff003;  Sriram Sankararaman aff003
Působiště autorů: Department of Computer Science, Indian Institute of Technology, Delhi, India aff001;  Bioinformatics Interdepartmental Program, University of California, Los Angeles, California United States of America aff002;  Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America aff002;  Department of Computer Science, University of California, Los Angeles, California United States of America aff003;  Department of Computer Science, University of California, Los Angeles, California, United States of America aff003;  Department of Human Genetics, University of California, Los Angeles, California United States of America aff004;  Department of Human Genetics, University of California, Los Angeles, California, United States of America aff004;  Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California United States of America aff005;  Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California, United States of America aff005;  Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California United States of America aff006;  Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America aff006;  Institute of Precision Health, University of California, Los Angeles, California, United States of America aff007
Vyšlo v časopise: Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008773
Kategorie: Research Article
doi: 10.1371/journal.pgen.1008773


Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

Klíčová slova:

Algorithms – Genome-wide association studies – Genomic signal processing – Genomics statistics – Molecular genetics – Principal component analysis – Singular value decomposition – Variant genotypes


1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annual review of genomics and human genetics. 2011;12:245–274. 21801023

2. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko A, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274.

3. Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nature genetics. 2012;44(6):725–731. doi: 10.1038/ng.2285 22610118

4. Baran Y, Quintela I, Carracedo Á, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92(6):882–894. doi: 10.1016/j.ajhg.2013.04.023 23726367

5. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature reviews Genetics. 2010;11(7):459. doi: 10.1038/nrg2813 20548291

6. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2(12):e190+. doi: 10.1371/journal.pgen.0020190 17194218

7. Hanis CL, Chakraborty R, Ferrell RE, Schull WJ. Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican-Americans in Starr County, Texas. Am J Phys Anthropol. 1986;70(4):433–441. 3766713

8. Pritchard J, Stephens M, Donnelly P. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000;155:945–959. 10835412

9. Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Molecular Ecology Resources. 2007;7(5):747–756.

10. Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS genetics. 2010;6(9):e1001117. doi: 10.1371/journal.pgen.1001117 20862358

11. Jolliffe IT. Principal Component Analysis and Factor Analysis. In: Principal component analysis. Springer; 1986. p. 115–128.

12. Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. The American Journal of Human Genetics. 2016;98(3):456–472. doi: 10.1016/j.ajhg.2015.12.022 26924531

13. Abraham G, Qiu Y, Inouye M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics. 2017;. 28475694

14. Prive F, Aschard H, Ziyatdinov A, Blum M. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. doi: 10.1093/bioinformatics/bty185 29617937

15. Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics. 2019;35(19):3679–3683. 30957838

16. Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8 25722852

17. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. 16862161

18. Canela-Xandri O, Law A, Gray A, Woolliams JA, Tenesa A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nature communications. 2015;6:10162. doi: 10.1038/ncomms10162 26657010

19. Roweis ST. EM algorithms for PCA and SPCA. In: Advances in neural information processing systems; 1998. p. 626–632.

20. Tipping ME, Bishop CM. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999;61(3):611–622.

21. Liberty E, Zucker SW. The mailman algorithm: A note on matrix–vector multiplication. Information Processing Letters. 2009;109(3):179–182.

22. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z 30305743

23. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795 17701901

24. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.

25. Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpar V, Huang J, et al. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human genomics. 2004;1(4):274. doi: 10.1186/1479-7364-1-4-274 15588487

26. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS genetics. 2008;4(1):e4. doi: 10.1371/journal.pgen.0040004 18208329

27. Wiegering A, Ruther U, Gerhardt C. The ciliary protein Rpgrip1l in development and disease. Dev Biol. 2018;442(1):60–68. 30075108

28. Delous M, Baala L, Salomon R, Laclef C, Vierkotten J, Tory K, et al. The ciliary gene RPGRIP1L is mutated in cerebello-oculo-renal syndrome (Joubert syndrome type B) and Meckel syndrome. Nature Genetics. 2007;39:875–881. 17558409

29. Devuyst O, Arnould VJ. Mutations in RPGRIP1L: extending the clinical spectrum of ciliopathies. Nephrology Dialysis Transplantation. 2008;23(5):1500–1503.

30. Khanna H, Davis EE, Murga-Zamalloa CA, Estrada-Cuzcano A, Lopez I, den Hollander AI, et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nature Genetics. 2009;41(6):739–45. doi: 10.1038/ng.366 19430481

31. Aschard H, Vilhjálmsson B, Greliche N, Morange P, Trégouët D, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. AJHG. 2014;94(5):662–76.

32. Korneev K, Atretkhany K, Drutskaya M, Grivennikov S, Kuprash D, Nedospasov S. TLR-signaling and proinflammatory cytokines as drivers of tumorigenesis. Cytokine. 2017;89:127–135. 26854213

33. Mockenhaupt F, Cramer J, Hamann L, Stegemann M, Eckert J, Oh N, et al. Toll-like receptor (TLR) polymorphisms in African children: Common TLR-4 variants predispose to severe malaria. PNAS. 2006;103(1):177–182. doi: 10.1073/pnas.0506803102 16371473

34. Van der Graaf C, Netea M, Morré S, Den Heijer M, Verweij P, Van der Meer J, et al. Toll-like receptor 4 Asp299Gly/Thr399Ile polymorphisms are a risk factor for Candida bloodstream infection. European Cytokine Network. 2006;17(1):29–34. 16613760

35. Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, et al. Detection of human adaptation during the past 2000 years. Science. 2016;354(6313):760–764. doi: 10.1126/science.aag0776 27738015

36. Albers, McVean. Dating genomic variants and shared ancestry in population-scale sequencing data. bioRxiv. 2019.

37. Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–i194. doi: 10.1093/bioinformatics/bty253 29950019

38. Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review. 2011;53(2):217–288.

39. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature genetics. 2012;44(3):243. doi: 10.1038/ng.1074 22306651

40. Hellenthal G, Auton A, Falush D. Inferring Human Colonization History Using a Copying Model. PLoS Genet. 2008;4(5):e1000078. doi: 10.1371/journal.pgen.1000078 18497854

41. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233. 14704198

42. Wen X, Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics. 2010;4(3):1158. 21479081

43. Schein AI, Saul LK, Ungar LH. A generalized linear model for principal component analysis of binary data. In: AISTATS. vol. 3; 2003. p. 10.

44. Li W, Cerise J, Yang Y, Han H. Application of t-SNE to human genetic data. J Bioinform Comput Biol. 2017;15(4):1750017. 28718343

45. Becht E, McInnes L, Healy J, Dutertre C, Kwok I, Ng L, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–44.

46. Anderson TW, Rubin H. Statistical inference in factor analysis. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability. vol. 5; 1956. p. 111–150.

47. Szlam A, Tulloch A, Tygert M. Accurate Low-Rank Approximations Via a Few Iterations of Alternating Least Squares. SIAM Journal on Matrix Analysis and Applications. 2017;38(2):425–433.

48. Lehoucq RB, Sorensen DC. Deflation techniques for an implicitly restarted Arnoldi iteration. SIAM Journal on Matrix Analysis and Applications. 1996;17(4):789–821.

49. Manichaikul A, Mychaleckyj J, Rich S, Daly K, Sale M, Chen W. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–2873. doi: 10.1093/bioinformatics/btq559 20926424

Genetika Reprodukční medicína

Článek vyšel v časopise

PLOS Genetics

2020 Číslo 5

Nejčtenější v tomto čísle

Tomuto tématu se dále věnují…


Zvyšte si kvalifikaci online z pohodlí domova

Kurz originály vs. generika
nový kurz

Klinická farmakokinetika betablokátorů

Současné možnosti terapie osteoartrózy
Autoři: MUDr. Jakub Holešovský

Preferovaná úlevová léčba Asthma Bronchiale
Autoři: PharmDr. Petr Sedlák

Inhibitory karboanhydrázy v léčbě glaukomu
Autoři: MUDr. Petr Výborný, CSc., FEBO

Všechny kurzy
Zapomenuté heslo

Nemáte účet?  Registrujte se

Zapomenuté heslo

Zadejte e-mailovou adresu se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se