Stability of SARS-CoV-2 phylogenies

Autoři: Yatish Turakhia aff001;  Nicola De Maio aff003;  Bryan Thornlow aff001;  Landen Gozashti aff001;  Robert Lanfear aff005;  Conor R. Walker aff003;  Angie S. Hinrichs aff002;  Jason D. Fernandes aff001;  Rui Borges aff008;  Greg Slodkowicz aff009;  Lukas Weilguny aff003;  David Haussler aff001;  Nick Goldman aff003;  Russell Corbett-Detig aff001
Působiště autorů: Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, United States of America aff001;  Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, United States of America aff002;  European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom aff003;  Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States of America aff004;  Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia aff005;  Department of Genetics, University of Cambridge, Cambridge, United Kingdom aff006;  Howard Hughes Medical Institute, University of California, Santa Cruz, CA, United States of America aff007;  Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria aff008;  MRC Laboratory of Molecular Biology, Cambridge, United Kingdom aff009
Vyšlo v časopise: Stability of SARS-CoV-2 phylogenies. PLoS Genet 16(11): e1009175. doi:10.1371/journal.pgen.1009175
Kategorie: Research Article
doi: 10.1371/journal.pgen.1009175


The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab—or protocol—specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared ( and We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.

Klíčová slova:

Alleles – Genomics – Microbial mutation – Phylogenetic analysis – Phylogenetics – SARS CoV 2 – Trees – Viral evolution


1. NCBI Staff. NCBI Insights: INSDC Statement on SARS-CoV-2 sequence data sharing during COVID-19. 17 Aug 2020 [cited 26 Aug 2020]. Available:

2. Maurano MT, Ramaswami S, Westby G, Zappile P, Dimartino D, Shen G, et al. Sequencing identifies multiple, early introductions of SARS-CoV2 to New York City Region. doi: 10.1101/2020.04.15.20064931 32511587

3. Deng X, Gu W, Federman S, Du Plessis L, Pybus O, Faria N, et al. A Genomic Survey of SARS-CoV-2 Reveals Multiple Introductions into Northern California without a Predominant Lineage. doi: 10.1101/2020.03.27.20044925 32511579

4. Zhang Y-Z, Holmes EC. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell. 2020;181:223–227. doi: 10.1016/j.cell.2020.03.035 32220310

5. Bal A, Destras G, Gaymard A, Bouscambert-Duchamp M, Valette M, Escuret V, et al. Molecular characterization of SARS-CoV-2 in the first COVID-19 cluster in France reveals an amino-acid deletion in nsp2 (Asp268Del). doi: 10.1016/j.cmi.2020.03.020 32234449

6. Grubaugh ND, Ladner JT, Lemey P, Pybus OG, Rambaut A, Holmes EC, et al. Tracking virus outbreaks in the twenty-first century. Nat Microbiol. 2019;4:10–19. doi: 10.1038/s41564-018-0296-2 30546099

7. Yi H. 2019 novel coronavirus is undergoing active recombination. Clin Infect Dis. 2020. doi: 10.1093/cid/ciaa219 32130405

8. Chaw S-M, Tai J-H, Chen S-L, Hsieh C-H, Chang S-Y, Yeh S-H, et al. The origin and underlying driving forces of the SARS-CoV-2 outbreak. doi: 10.1186/s12929-020-00665-8 32507105

9. van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection, Genetics and Evolution. 2020. p. 104351. doi: 10.1016/j.meegid.2020.104351 32387564

10. Li Y, Wang Y, Qiu Y, Gong Z, Deng L, Pan M, et al. SARS-CoV-2 Spike Glycoprotein Receptor Binding Domain is Subject to Negative Selection with Predicted Positive Selection Mutations. doi: 10.1101/2020.05.04.077842

11. Victorovich KV, Rajanish G, Aleksandrovna KT, Krishna KS, Nicolaevich SA, Vitoldovich PV. Translation-associated mutational U-pressure in the first ORF of SARS-CoV-2 and other coronaviruses. doi: 10.3389/fmicb.2020.559165 33072018

12. Zehender G, Lai A, Bergna A, Meroni L, Riva A, Balotta C, et al. GENOMIC CHARACTERISATION AND PHYLOGENETIC ANALYSIS OF SARS-COV-2 IN ITALY. doi: 10.1101/2020.03.15.20032870

13. Gardy JL, Loman NJ. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet. 2018;19:9–20. doi: 10.1038/nrg.2017.88 29129921

14. Chitranshi N, Gupta VK, Rajput R, Godinez A, Pushpitha K, Sheng T, et al. Evolving geographic diversity in SARS-CoV2 and in silico analysis of replicating enzyme 3CLPro targeting repurposed drug candidates. doi: 10.1186/s12967-020-02448-z 32646487

15. Adebali O, Bircan A, Circi D, Islek B, Kilinc Z, Selcuk B, et al. Phylogenetic Analysis of SARS-CoV-2 Genomes in Turkey. doi: 10.3906/biy-2005-35 32595351

16. Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018. pp. 4121–4123. doi: 10.1093/bioinformatics/bty407 29790939

17. Neher RA, Bedford T. nextflu: real-time tracking of seasonal influenza virus evolution in humans. Bioinformatics. 2015. pp. 3546–3548. doi: 10.1093/bioinformatics/btv381 26115986

18. Rambaut A, Holmes EC, Hill V, O’Toole Á, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology. doi: 10.1038/s41564-020-0770-5 32669681

19. Mavian C, Marini S, Prosperi M, Salemi M. A snapshot of SARS-CoV-2 genome availability up to 30th March, 2020 and its implications. doi: 10.1101/2020.04.01.020594

20. Fountain-Jones NM, Appaw RC, Carver S, Didelot X, Volz EM, Charleston M. Emerging phylogenetic structure of the SARS-CoV-2 pandemic. bioRxiv. 2020. p. 2020.05.19.103846. doi: 10.1101/2020.05.19.103846

21. Bogner P, Capua I, Lipman DJ, Cox NJ. A global initiative on sharing avian flu data. Nature. 2006. pp. 981–981. doi: 10.1038/442981a

22. Rayko M, Komissarov A. Quality control of low-frequency variants in SARS-CoV-2 genomes. doi: 10.1101/2020.04.26.062422

23. Akther S, Bezrucenkovas E, Sulkow B, Panlasigui C. CoV Genome Tracker: tracing genomic footprints of Covid-19 pandemic. bioRxiv. 2020. Available:

24. DeMaio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Issues with SARS-CoV-2 sequencing data. In: Virological [Internet]. 5 May 2020 [cited 13 May 2020]. Available:

25. Freeman TM, Genomics England Research Consortium, Wang D, Harris J. Genomic loci susceptible to systematic sequencing bias in clinical whole genomes. Genome Res. 2020;30: 415–426. doi: 10.1101/gr.255349.119 32156711

26. van Dorp L, Richard D, Tan CCS, Shaw LP, Acman M, Balloux F. No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2. 2020. p. 2020.05.21.108506. doi: 10.1101/2020.05.21.108506

27. Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. doi: 10.1101/2020.04.29.069054

28. Lythgoe KA, Hall MD, Ferretti L, de Cesare M, MacIntyre-Cockett G, Trebes A, et al. Shared SARS-CoV-2 diversity suggests localised transmission of minority variants. doi: 10.1101/2020.05.28.118992

29. Banerjee AK, Begum F, Ray U. Mutation Hot Spots in Spike Protein of COVID-19. doi: 10.20944/preprints202004.0281.v1

30. Laamarti M, Alouane T, Kartti S, Chemao-Elfihri MW, Hakmi M, Essabbar A, et al. Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutations. doi: 10.1371/journal.pone.0240345 33170902

31. Wang C, Liu Z, Chen Z, Huang X, Xu M, He T, et al. The establishment of reference sequence for SARS-CoV-2 and variation analysis. Journal of Medical Virology. 2020. pp. 667–674. doi: 10.1002/jmv.25762 32167180

32. Wang Y, Mao J-M, Wang G-D, Qiu Z, Yao Q, Chen K-P. Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames. doi: 10.1038/s41598-020-69342-y 32704018

33. Wen F, Yu H, Guo J, Li Y, Luo K, Huang S. Identification of the hyper-variable genomic hotspot for the novel coronavirus SARS-CoV-2. J Infect. 2020. doi: 10.1016/j.jinf.2020.02.027 32145215

34. Pachetti M, Marini B, Benedetti F, Giudici F, Mauro E, Storici P, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. doi: 10.1186/s12967-020-02344-6 32321524

35. Rehman SU, Shafique L, Ihsan A, Liu Q. Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2. Pathogens. 2020;9. doi: 10.3390/pathogens9030240 32210130

36. Wertheim JO. A Glimpse Into the Origins of Genetic Diversity in the Severe Acute Respiratory Syndrome Coronavirus 2. Clinical Infectious Diseases. 2020. doi: 10.1093/cid/ciaa213 32129842

37. Vasilarou M, Alachiotis N, Garefalaki J, Beloukas A, Pavlidis P. Population genomics insights into the recent evolution of SARS-CoV-2. doi: 10.1101/2020.04.21.054122

38. Ou J, Zhou Z, Dai R, Zhang J, Lan W, Zhao S, et al. Emergence of RBD mutations in circulating SARS-CoV-2 strains enhancing the structural stability and human ACE2 receptor affinity of the spike protein. bioRxiv. 2020. p. 2020.03.15.991844. doi: 10.1101/2020.03.15.991844

39. Sashittal P, Luo Y, Peng J, El-Kebir M. Characterization of SARS-CoV-2 viral diversity within and across hosts. bioRxiv. 2020. p. 2020.05.07.083410. doi: 10.1101/2020.05.07.083410

40. Velazquez-Salinas L, Zarate S, Eberl S, Gladue DP, Novella I, Borca MV. Positive selection of ORF3a and ORF8 genes drives the evolution of SARS-CoV-2 during the 2020 COVID-19 pandemic. doi: 10.3389/fmicb.2020.550674 33193132

41. Brianna SC, Paskov K, Stockham N, J-Y J, Varma M, Washington P, et al. Common Microdeletions in SARS-CoV-2 Sequences. In: Virological [Internet]. 15 May 2020 [cited 16 May 2020]. Available:

42. Ramazzotti D, Angaroni F, Maspero D, Gambacorti-Passerini C, Antoniotti M, Graudenzi A, et al. Characterization of intra-host SARS-CoV-2 variants improves phylogenomic reconstruction and may reveal functionally convergent mutations. doi: 10.1101/2020.04.22.044404

43. Dellicour S, Durkin K, Hong SL, Vanmechelen B, Martí-Carreras J, Gill MS, et al. A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. doi: 10.1101/2020.05.05.078758

44. Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. bioRxiv. 2020. doi: 10.1101/2020.08.05.239046

45. Rice AM, Morales AC, Ho AT, Mordstein C, Mühlhausen S, Watson S, et al. Evidence for strong mutation bias towards, and selection against, T/U content in SARS-CoV2: implications for attenuated vaccine design. doi: 10.1101/2020.05.11.088112

46. Xia X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol Biol Evol. 2020. doi: 10.1093/molbev/msaa094 32289821

47. Fitch WM. Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology. Systematic Zoology. 1971. p. 406. doi: 10.2307/2412116

48. Sankoff D. Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics. 1975. pp. 35–42. doi: 10.1137/0128004

49. Simmonds P. Rampant C->U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses–causes and consequences for their short and long evolutionary trajectories. doi: 10.1101/2020.05.01.072330

50. Bishop KN, Holmes RK, Sheehy AM, Malim MH. APOBEC-mediated editing of viral RNA. Science. 2004;305:645. doi: 10.1126/science.1100658 15286366

51. Giorgio SD, Di Giorgio S, Martignano F, Torcia MG, Mattiuz G, Conticello SG. Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2. doi: 10.1126/sciadv.abb5813 32596474

52. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20:50. doi: 10.1186/s13059-019-1659-6 30867008

53. Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011;12:R112. doi: 10.1186/gb-2011-12-11-r112 22067484

54. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–345. doi: 10.1038/nbt.4060 29431738

55. Beerenwinkel N, Günthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol. 2012;3:329. doi: 10.3389/fmicb.2012.00329 22973268

56. Kugelman JR, Wiley MR, Nagle ER, Reyes D, Pfeffer BP, Kuhn JH, et al. Error baseline rates of five sample preparation methods used to characterize RNA virus populations. PLoS One. 2017;12:e0171333. doi: 10.1371/journal.pone.0171333 28182717

57. Orton RJ, Wright CF, Morelli MJ, King DJ, Paton DJ, King DP, et al. Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data. BMC Genomics. 2015;16:229. doi: 10.1186/s12864-015-1456-x 25886445

58. McElroy K, Thomas T, Luciani F. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions. Microb Inform Exp. 2014;4:1. doi: 10.1186/2042-5783-4-1 24428920

59. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015 32011700

60. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. doi: 10.1093/molbev/msx281 29077904

61. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330 21653522

62. Fernandes JD, Hinrichs AS, Clawson H, Gonzalez JN, Lee BT, Nassar LR, et al. The UCSC SARS-CoV-2 Genome Browser. doi: 10.1038/s41588-020-0700-8 32908258

63. Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral Mutation Rates. Journal of Virology. 2010. pp. 9733–9748. doi: 10.1128/JVI.00694-10 20660197

64. Bogdanowicz D, Giaro K, Wróbel B. TreeCmp: Comparison of Trees in Polynomial Time. Evolutionary Bioinformatics. 2012. p. EBO.S9657. doi: 10.4137/ebo.s9657

65. Malafiejska A. New scalable measure for comparing phylogenetic trees. 2008 1st International Conference on Information Technology. 2008. doi: 10.1109/inftech.2008.4621645

66. Kendall M, Eldholm V, Colijn C. Comparing phylogenetic trees according to tip label categories. doi: 10.1101/251710

67. Nye TMW. Trees of Trees: An Approach to Comparing Multiple Alternative Phylogenies. Systematic Biology. 2008. pp. 785–794. doi: 10.1080/10635150802424072 18853364

68. Bogdanowicz D. Comparing phylogenetic trees using a minimum weight perfect matching. 2008 1st International Conference on Information Technology. 2008. doi: 10.1109/inftech.2008.4621680

69. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Mathematical Biosciences. 1981. pp. 131–147. doi: 10.1016/0025-5564(81)90043-2

70. Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 2012;61:1061–1067. doi: 10.1093/sysbio/sys062 22780991

71. Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution. 2012. pp. 217–223. doi: 10.1111/j.2041-210x.2011.00169.x

72. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26:1569–1571. doi: 10.1093/bioinformatics/btq228 20421198

73. Hodcroft EB, Hadfield J, Neher RA, Bedford T. Year-letter Genetic Clade Naming for SARS-CoV-2 on In: Virological [Internet]. 2 Jun 2020 [cited 8 Jun 2020]. Available:

74. An integrated national scale SARS-CoV-2 genomic surveillance network. The Lancet Microbe. 2020. doi: 10.1016/S2666-5247(20)30054-9 32835336

75. Margush T, McMorris FR. Consensus n-trees. Bulletin of Mathematical Biology. 1981. pp. 239–244. doi: 10.1007/bf02459446

76. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017. doi: 10.2807/ 28382917

77. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data—from vision to reality. Euro Surveill. 2017;22. doi: 10.2807/1560-7917.ES.2017.22.13.30494 28382917

78. Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning-ICML ‘09. 2009. doi: 10.1145/1553374.1553511

79. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300 25371430

Článek vyšel v časopise

PLOS Genetics

2020 Číslo 11
Nejčtenější tento týden
Nejčtenější v tomto čísle
Zapomenuté heslo

Nemáte účet?  Registrujte se

Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se