Accounting for long-range correlations in genome-wide simulations of large cohorts

Autoři: Dominic Nelson aff001;  Jerome Kelleher aff002;  Aaron P. Ragsdale aff001;  Claudia Moreau aff003;  Gil McVean aff002;  Simon Gravel aff001
Působiště autorů: McGill University and Genome Québec Innovation Centre, McGill University, Montréal, Québec, Canada aff001;  Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom aff002;  Centre Intersectoriel en Santé Durable, Université du Québec à Chicoutimi, Saguenay, Québec, Canada aff003
Vyšlo v časopise: Accounting for long-range correlations in genome-wide simulations of large cohorts. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008619
Kategorie: Research Article
doi: 10.1371/journal.pgen.1008619


Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.

Klíčová slova:

DNA recombination – Effective population size – Genetic polymorphism – Genome evolution – Linkage disequilibrium – Population genetics – Population size – Simulation and modeling


1. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium. The American Journal of Human Genetics. 2004;74(1):106–120. doi: 10.1086/381000 14681826

2. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS biology. 2006;4(3):e72. doi: 10.1371/journal.pbio.0040072 16494531

3. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics. 2009;5(10). doi: 10.1371/journal.pgen.1000695 19851460

4. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–6. doi: 10.1038/nature10231 21753753

5. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233. 14704198

6. Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C. Genomic scans for selective sweeps using SNP data. Genome Research. 2005;15(11):1566–1575. doi: 10.1101/gr.4252305 16251466

7. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337 11847089

8. Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol. 2016;12(5):1–22. doi: 10.1371/journal.pcbi.1004842

9. Hudson RR. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. 1983;23(2):183–201. doi: 10.1016/0040-5809(83)90013-8 6612631

10. Kelleher J, Thornton KR, Ashander J, Ralph PL. Efficient pedigree recording for fast population genetics simulation. PLoS computational biology. 2018;14(11):e1006581. doi: 10.1371/journal.pcbi.1006581 30383757

11. Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. Inferring whole-genome histories in large population datasets. Nature Genetics. 2019;51(9):1330–1338. doi: 10.1038/s41588-019-0483-y 31477934

12. Wakeley J, King L, Low BS, Ramachandran S. Gene genealogies within a fixed pedigree, and the robustness of kingman’s coalescent. Genetics. 2012;190(4):1433–1445. doi: 10.1534/genetics.111.135574 22234858

13. Bhaskar A, Clark AG, Song YS. Distortion of genealogical properties when the sample is very large. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(6):2385–90. doi: 10.1073/pnas.1322709111 24469801

14. Palamara PF. ARGON: fast, whole-genome simulation of the discrete time Wright-fisher process. Bioinformatics. 2016;32(19):3032–3034. doi: 10.1093/bioinformatics/btw355 27312410

15. Hudson RR. Gene genealogies and the coalescent process. In: Futuyma D. and Antonovics J. (eds), Oxford Surveys in Evolutionary Biology. vol. 7; 1990. p. 1–44.

16. Wilton PR, Baduel P, Landon MM, Wakeley J. Population structure and coalescence in pedigrees: Comparisons to the structured coalescent and a framework for inference. Theoretical Population Biology. 2017;115:1–12. doi: 10.1016/j.tpb.2017.01.004 28143695

17. King L, Wakeley J, Carmi S. A non-zero variance of Tajima’s estimator for two sequences even for infinitely many unlinked loci. Theoretical Population Biology. 2018;122:22–29. doi: 10.1016/j.tpb.2017.03.002 28341209

18. Liang M, Nielsen R. The lengths of admixture tracts. Genetics. 2014;197(3):953–967. doi: 10.1534/genetics.114.162362 24770332

19. Ball RM, Neigel JE, Avise JC. Gene Genealogies within the Organismal Pedigrees of Random-Mating Populations. Evolution. 1990;44(2):360. doi: 10.1111/j.1558-5646.1990.tb05205.x 28564387

20. Verhoeven KJF, Simonsen KL. Genomic haplotype blocks may not accurately reflect spatial variation in historic recombination intensity. Molecular Biology and Evolution. 2005;22(3):735–740. doi: 10.1093/molbev/msi058 15563716

21. Davies JL, Simančík F, Lyngsø R, Mailund T, Hein J. On recombination-induced multiple and simultaneous coalescent events. Genetics. 2007;177(4):2151–2160. doi: 10.1534/genetics.107.071126 17947442

22. Henn BM, Hon L, Macpherson JM, Eriksson N, Saxonov S, Pe’er I, et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS ONE. 2012;7(4). doi: 10.1371/journal.pone.0034267

23. Shchur V, Nielsen R. On the number of siblings and p-th cousins in a large population sample. Journal of Mathematical Biology. 2018;77(5):1–20. doi: 10.1007/s00285-018-1252-8

24. Genome Quebec. Genizon Biobank; (2020).

25. Waples RS. A bias correction for estimates of effective population size based on linkage disequilibrium at unlinked gene loci. Conservation Genetics. 2006;7(2):167–184. doi: 10.1007/s10592-005-9100-y

26. Ragsdale AP, Gravel S. Unbiased Estimation of Linkage Disequilibrium from Unphased Data. Molecular Biology and Evolution. 2019.

27. Gravel S. Population genetics models of local ancestry. Genetics. 2012;191(2):607–619. doi: 10.1534/genetics.112.139808 22491189

28. Fisher R. The genetical theory of natural selection. Clarendon Press; 1930.

29. Wright S. Evolution in Mendelian populations. Genetics. 1931;16(2):97. 17246615

30. BALSAC. BALSAC Population Database: 2016-2017 Annual Report.; 2018.

31. Caballero M, Seidman DN, Qiao Y, Sannerud J, Dyer TD, Lehman DM, et al. Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLOS Genetics. 2019;15(12):1–29. doi: 10.1371/journal.pgen.1007979

Článek vyšel v časopise

PLOS Genetics

2020 Číslo 5
Nejčtenější tento týden
Nejčtenější v tomto čísle

Zvyšte si kvalifikaci online z pohodlí domova

Třikrát z interní medicíny
nový kurz
Autoři: MUDr. Jana Kubátová

Pokročilá Parkinsonova nemoc − úskalí a možnosti léčby
Autoři: doc. MUDr. Marek Baláž, Ph.D.

Léčba diabetes mellitus 2. typu pomocí GLP- 1 RA

Depresivní porucha a zánětlivé procesy
Autoři: MUDr. Juraj Tkáč

Methotrexát a jeho formy podávání v revmatologii
Autoři: MUDr. Liliana Šedová

Všechny kurzy
Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se