Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

Autoři: Sahir R. Bhatnagar aff001;  Yi Yang aff003;  Tianyuan Lu aff004;  Erwin Schurr aff006;  JC Loredo-Osti aff007;  Marie Forest aff008;  Karim Oualkacha aff009;  Celia M. T. Greenwood aff001
Působiště autorů: Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada aff001;  Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada aff002;  Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada aff003;  Quantitative Life Sciences, McGill University, Montreal, Québec, Canada aff004;  Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada aff005;  Department of Medicine, McGill University, Montréal, Québec, Canada aff006;  Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada aff007;  École de Technologie Supérieure, Montréal, Québec, Canada aff008;  Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada aff009;  Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada aff010;  Department of Human Genetics, McGill University, Montreal, Quebec, Canada aff011
Vyšlo v časopise: Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008766
Kategorie: Research Article


Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects’ relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (

Klíčová slova:

Algorithms – Covariance – Genetic loci – Genome-wide association studies – Mathematical models – Molecular genetics – Simulation and modeling – Variant genotypes


1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747. doi: 10.1038/nature08494 19812666

2. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nature genetics. 2010;42(7):565. doi: 10.1038/ng.608 20562875

3. Astle W, Balding DJ, et al. Population structure and cryptic relatedness in genetic association studies. Statistical Science. 2009;24(4):451–471.

4. Song M, Hao W, Storey JD. Testing for genetic associations in arbitrarily structured populations. Nature genetics. 2015;47(5):550–554. doi: 10.1038/ng.3244 25822090

5. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nature genetics. 2004;36(5):512. doi: 10.1038/ng1337 15052271

6. Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS genetics. 2008;4(7):e1000130. doi: 10.1371/journal.pgen.1000130 18654633

7. Li J, Das K, Fu G, Li R, Wu R. The Bayesian lasso for genome-wide association studies. Bioinformatics. 2010;27(4):516–523. doi: 10.1093/bioinformatics/btq688 21156729

8. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nature methods. 2011;8(10):833–835. doi: 10.1038/nmeth.1681 21892150

9. Kang HM, Sul JH, Zaitlen NA, Kong Sy, Freimer NB, Sabatti C, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42(4):348. doi: 10.1038/ng.548 20208533

10. Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics. 2006;38(2):203. doi: 10.1038/ng1702 16380716

11. Eu-Ahsunthornwattana J, Miller EN, Fakiola M, Jeronimo SM, Blackwell JM, Cordell HJ, et al. Comparison of methods to account for relatedness in genome-wide association studies with family-based data. PLoS Genet. 2014;10(7):e1004445. doi: 10.1371/journal.pgen.1004445 25033443

12. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics. 2006;38(8):904. doi: 10.1038/ng1847 16862161

13. Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, et al. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genetic epidemiology. 2013;37(4):366–376. doi: 10.1002/gepi.21725 23529756

14. Cordell HJ, Clayton DG. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. The American Journal of Human Genetics. 2002;70(1):124–141. doi: 10.1086/338007 11719900

15. Rakitsch B, Lippert C, Stegle O, Borgwardt K. A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics. 2013;29(2):206–214. doi: 10.1093/bioinformatics/bts669 23175758

16. Wang D, Eskridge KM, Crossa J. Identifying QTLs and epistasis in structured plant populations using adaptive mixed LASSO. Journal of agricultural, biological, and environmental statistics. 2011;16(2):170–184.

17. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996; p. 267–288.

18. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101(476):1418–1429.

19. Ding X, Su S, Nandakumar K, Wang X, Fardo DW. A 2-step penalized regression method for family-based next-generation sequencing association studies. In: BMC proceedings. vol. 8. BioMed Central; 2014. p. S25.

20. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33(1):1. 20808728

21. Yang Y, Zou H. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing. 2015;25(6):1129–1141.

22. Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics. 2014;46(2):100. doi: 10.1038/ng.2876 24473328

23. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320.

24. Gilmour AR, Thompson R, Cullis BR. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995; p. 1440–1450.

25. Dandine-Roulland C. gaston: Genetic Data Handling (QC, GRM, LD, PCA) and Linear Mixed Models; 2018. Available from:

26. Ochoa A, Storey JD. FST and kinship for arbitrary population structures I: Generalized definitions. bioRxiv. 2016.

27. Ochoa A, Storey JD. FST and kinship for arbitrary population structures II: Method of moments estimators. bioRxiv. 2016.

28. Reid S, Tibshirani R, Friedman J. A study of error variance estimation in lasso regression. Statistica Sinica. 2016; p. 35–67.

29. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203. doi: 10.1038/s41586-018-0579-z 30305743

30. Biobank U. Genotyping and quality control of UK Biobank, a large-scale, extensively phenotyped prospective resource. Available at biobank ctsu ox ac uk/crystal/docs/genotyping_qc pdf Accessed April. 2015;1:2016.

31. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–2873. doi: 10.1093/bioinformatics/btq559 20926424

32. Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry. Human molecular genetics. 2018;27(20):3641–3649. doi: 10.1093/hmg/ddy271 30124842

33. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016;48(10):1279. doi: 10.1038/ng.3643 27548312

34. Zhou X, Carbonetto P, Stephens M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS genetics. 2013;9(2):e1003264. doi: 10.1371/journal.pgen.1003264 23408905

35. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nature genetics. 2012;44(7):821. doi: 10.1038/ng.2310 22706312

36. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology. 2003;32(1):1–22.

37. Cherlin S, Howey RA, Cordell HJ. Using penalized regression to predict phenotype from SNP data. In: BMC proceedings. vol. 12. BioMed Central; 2018. p. 38.

38. Zhou W, Lo SH. Analysis of genotype by methylation interactions through sparsity-inducing regularized regression. In: BMC proceedings. vol. 12. BioMed Central; 2018. p. 40.

39. Howey RA, Cordell HJ. Application of Bayesian networks to GAW20 genetic and blood lipid data. In: BMC proceedings. vol. 12. BioMed Central; 2018. p. 19.

40. Thornton T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, Risch N. Estimating kinship in admixed populations. The American Journal of Human Genetics. 2012;91(1):122–138. doi: 10.1016/j.ajhg.2012.05.024 22748210

41. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome research. 2009;19(9):1655–1664. doi: 10.1101/gr.094052.109 19648217

42. Fortin A, Diez E, Rochefort D, Laroche L, Malo D, Rouleau GA, et al. Recombinant congenic strains derived from A/J and C57BL/6J: a tool for genetic dissection of complex traits. Genomics. 2001;74(1):21–35. doi: 10.1006/geno.2001.6528 11374899

43. Bennett BJ, Farber CR, Orozco L, Kang HM, Ghazalpour A, Siemers N, et al. A high-resolution association mapping panel for the dissection of complex traits in mice. Genome research. 2010;20(2):281–290. doi: 10.1101/gr.099234.109 20054062

44. Flint J, Eskin E. Genome-wide association studies in mice. Nature Reviews Genetics. 2012;13(11):807. doi: 10.1038/nrg3335 23044826

45. Cheng R, Lim JE, Samocha KE, Sokoloff G, Abney M, Skol AD, et al. Genome-wide association studies and the problem of relatedness among advanced intercross lines and other highly recombinant populations. Genetics. 2010;185(3):1033–1044. doi: 10.1534/genetics.110.116863 20439773

46. Di Pietrantonio T, Hernandez C, Girard M, Verville A, Orlova M, Belley A, et al. Strain-specific differences in the genetic control of two closely related mycobacteria. PLoS pathogens. 2010;6(10):e1001169. doi: 10.1371/journal.ppat.1001169 21060820

47. Wang H, Lengerich BJ, Aragam B, Xing EP. Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics. 2018;35(7):1181–1187.

48. Sohrabi Y, Havelková H, Kobets T, Šíma M, Volkova V, Grekov I, et al. Mapping the Genes for Susceptibility and Response to Leishmania tropica in Mouse. PLoS neglected tropical diseases. 2013;7(7):e2282. doi: 10.1371/journal.pntd.0002282 23875032

49. Jackson AU, Fornés A, Galecki A, Miller RA, Burke DT. Multiple-trait quantitative trait loci analysis using a large mouse sibship. Genetics. 1999;151(2):785–795. 9927469

50. Stern MC, Benavides F, Klingelberger EA, Conti CJ. Allelotype analysis of chemically induced squamous cell carcinomas in F1 hybrids of two inbred mouse strains with different susceptibility to tumor progression. Carcinogenesis. 2000;21(7):1297–1301.

51. Lasko D, Cavenee W, Nordenskjöld M. Loss of constitutional heterozygosity in human cancer. Annual review of genetics. 1991;25(1):281–314. doi: 10.1146/ 1687498

52. Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjalmsson BJ, Finucane HK, Salem RM, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature genetics. 2015;47(3):284. doi: 10.1038/ng.3190 25642633

53. Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy and Technology. 2012;1(3):123–126.

54. Zeng Y, Breheny P. The biglasso package: a memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint arXiv:170105936. 2017.

55. Spain SL, Barrett JC. Strategies for fine-mapping complex traits. Human molecular genetics. 2015;24(R1):R111–R119. doi: 10.1093/hmg/ddv260 26157023

56. Pirinen M, Donnelly P, Spencer CC, et al. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. The Annals of Applied Statistics. 2013;7(1):369–390.

57. Schelldorfer J, Bühlmann P, DE G, VAN S. Estimation for High-Dimensional Linear Mixed-Effects Models Using L1-Penalization. Scandinavian Journal of Statistics. 2011;38(2):197–214.

58. Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming. 2009;117(1):387–423.

59. Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):53–71.

60. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing. 1995;16(5):1190–1208.

61. Wakefield J. Bayesian and frequentist regression methods. Springer Science & Business Media; 2013.

62. Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics. 1984; p. 758–765.

63. Zou H, Hastie T, Tibshirani R, et al. On the “degrees of freedom” of the lasso. The Annals of Statistics. 2007;35(5):2173–2192.

64. Bondell HD, Krishna A, Ghosh SK. Joint Variable Selection for Fixed and Random Effects in Linear Mixed-Effects Models. Biometrics. 2010;66(4):1069–1077. doi: 10.1111/j.1541-0420.2010.01391.x 20163404

65. Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(3):531–552.

Článek vyšel v časopise

PLOS Genetics

2020 Číslo 5
Nejčtenější tento týden
Nejčtenější v tomto čísle

Zvyšte si kvalifikaci online z pohodlí domova

Důležitost adherence při depresivním onemocnění
nový kurz
Autoři: MUDr. Eliška Bartečková, Ph.D.

Koncepce osteologické péče pro gynekology a praktické lékaře
Autoři: MUDr. František Šenk

Sekvenční léčba schizofrenie
Autoři: MUDr. Jana Hořínková, Ph.D.

Hypertenze a hypercholesterolémie – synergický efekt léčby
Autoři: prof. MUDr. Hana Rosolová, DrSc.

Multidisciplinární zkušenosti u pacientů s diabetem
Autoři: Prof. MUDr. Martin Haluzík, DrSc., prof. MUDr. Vojtěch Melenovský, CSc., prof. MUDr. Vladimír Tesař, DrSc.

Všechny kurzy
Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se