Genotype imputation using the Positional Burrows Wheeler Transform
Autoři:
Simone Rubinacci aff001; Olivier Delaneau aff001; Jonathan Marchini aff003
Působiště autorů:
Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
aff001; Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
aff002; Regeneron Genetics Center, Tarrytown, New York, USA
aff003
Vyšlo v časopise:
Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet 16(11): e1009049. doi:10.1371/journal.pgen.1009049
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1009049
Souhrn
Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.
Klíčová slova:
Algorithms – Consortia – Gene mapping – Genome-wide association studies – Genomics – Genotyping – Haplotypes – Hidden Markov models
Zdroje
1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nature Reviews Genetics. 2010;11(7):499–511. doi: 10.1038/nrg2796 20517342
2. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z 30305743
3. Zeggini E, Ioannidis JPA. Meta-analysis in genome-wide association studies. Pharmacogenomics. 2009;10(2):191–201. doi: 10.2217/14622416.10.2.191 19207020
4. Marchini JL. Haplotype Estimation and Genotype Imputation. In: Handbook of Statistical Genomics. 4th ed.; 2019.
5. Delaneau O, Zagury JF, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods. 2013;10(1):5–6. doi: 10.1038/nmeth.2307 23269371
6. Delaneau O, Zagury JF, Robinson MR, Marchini J, Dermitzakis E. Accurate, scalable and integrative haplotype estimation. Nature Communications. 2019; (10)5436. doi: 10.1038/s41467-019-13225-y 31780650
7. Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics. 2016;48(11):1443–1448. doi: 10.1038/ng.3679 27694958
8. Howie B, Marchini J, Stephens M. Genotype Imputation with Thousands of Genomes. G3: Genes, Genomes, Genetics. 2011;1(6):457–470. doi: 10.1534/g3.111.001198
9. International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449(7164):851–861. doi: 10.1038/nature06258 17943122
10. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393 26432245
11. Huang J, Howie B, McCarthy S, Memari Y, Walter K, Min JL, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nature Communications. 2015;6:8111. doi: 10.1038/ncomms9111 26368830
12. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016;48(10):1279–1283. doi: 10.1038/ng.3643 27548312
13. Brody J, Morrison A, Bis J, O’Connell J, Brown M, Huffman J, et al. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nature genetics. 2017;49(11):1560–1563. doi: 10.1038/ng.3968 29074945
14. Caulfield M, Davies J, Dennys M, Elbahy L, Fowler T, Hill S, et al. The 100,000 Genomes Project Protocol. https://doiorg/106084/m9figshare4530893v4. 2017.
15. Durbin R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics. 2014;30(9):1266–1272. doi: 10.1093/bioinformatics/btu014 24413527
16. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics. 2012;44(8):955–959. doi: 10.1038/ng.2354 22820512
17. Band G, Marchini J. BGEN: a binary file format for imputed genotype and haplotype data. BioRxiv. 2018;
18. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nature Genetics. 2016;48(10):1284–1287. doi: 10.1038/ng.3656 27571263
19. Browning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation Reference Panels. American Journal of Human Genetics. 2018;103(3):338–348. doi: 10.1016/j.ajhg.2018.07.015 30100085
20. Li H. BGT: efficient and flexible genotype query across many samples. Bioinformatics. 2016;32(4):590–592. doi: 10.1093/bioinformatics/btv613 26500154
21. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science; 2000. p. 390–398.
22. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233. 14704198
23. Browning BL, Browning SR. Genotype Imputation with Millions of Reference Samples. American Journal of Human Genetics. 2016;98(1):116–126. doi: 10.1016/j.ajhg.2015.11.020 26748515
24. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. doi: 10.1109/5.18626
25. Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS computational biology. 2016;12(5):e1004842. doi: 10.1371/journal.pcbi.1004842 27145223
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 11
- Jak a kdy u celiakie začíná reakce na lepek? Možnou odpověď poodkryla čerstvá kanadská studie
- Infekce se v Americe po příjezdu Kolumba šířily nesrovnatelně déle, než se traduje
- Jak může lékárník přispět ke zvýšení bezpečnosti terapie kortikosteroidy a zbavit pacienty obav z jejich nežádoucích účinků?
- Prof. Jan Škrha: Metformin je bezpečný, ale je třeba jej bezpečně užívat a léčbu kontrolovat
- Budou nanoléčiva lépe cílit na některé onkologické nemoci?
Nejčtenější v tomto čísle
- Stability of SARS-CoV-2 phylogenies
- Formal commentary
- No association between SCN9A and monogenic human epilepsy disorders
- Oxidative stress antagonizes fluoroquinolone drug sensitivity via the SoxR-SUF Fe-S cluster homeostatic axis