China COVID19 Mutation - On the origin and continuing evolution of SARS-CoV-2
Tuesday, March 10, 2020
Open BLOG above >>>> CLICK TITLE to Open >>>>>On the origin and continuing evolution of SARS-CoV-2ABSTRACT The SARS-CoV-2 epidemic started in late December 2019 in Wuhan, China, and has since impacted a large portion of China and raised major global concern. Herein, we investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses. Although we found only 4% variability in genomic nucleotides between SARS-CoV-2 and a bat SARS-related coronavirus (SARSr-CoV; RaTG13), the difference at neutral sites was 17%, suggesting the divergence between the two viruses is much larger than previously estimated. Our results suggest that the development of new variations in functional sites in the receptor-binding domain (RBD) of the spike seen in SARS-CoV-2 and viruses from pangolin SARSr-CoVs are likely caused by mutations and natural selection besides recombination. Population genetic analyses of 103 SARS-CoV-2 genomes indicated that these viruses evolved into two major types (designated L and S), that are well defined by two different SNPs that show nearly complete linkage across the viral strains sequenced to date......📷
RESEARCH ARTICLE MICROBIOLOGOn the origin and continuing evolution of SARS-CoV-2Xiaolu Tang1,7, Changcheng Wu1,7, Xiang Li2,3,4,7, Yuhe Song2,5,7, Xinmin Yao1, Xinkai Wu1, Yuange Duan1, Hong Zhang1, Yirong Wang1, Zhaohui Qian6, Jie Cui2,3,*, and Jian Lu1,*1. State Key Laboratory of Protein and Plant Gene Research, Center for Bioinformatics, School of Life Sciences, Peking University, Beijing, 100871, China2. CAS Key Laboratory of Molecular Virology & Immunology, Institut Pasteur of Shanghai, Chinese Academy of Sciences, China3. Center for Biosafety Mega-Science, Chinese Academy of Sciences, China4. University of Chinese Academy of Sciences, China5. School of Life Sciences, Shanghai University, China6. NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing7. These authors contributed equally to this work.*Corresponding authors:Jian Lu, Email: LUJ@pku.edu.cn Jie Cui, Email: jcui@ips.ac.cnABSTRACT The SARS-CoV-2 epidemic started in late December 2019 in Wuhan, China, and has since impacted a large portion of China and raised major global concern. Herein, we investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses. Although we found only 4% variability in genomic nucleotides between SARS-CoV-2 and a bat SARS-related coronavirus (SARSr-CoV; RaTG13), the difference at neutral sites was 17%, suggesting the divergence between the two viruses is much larger than previously estimated. Our results suggest that the development of new variations in functional sites in the receptor-binding domain (RBD) of the spike seen in SARS-CoV-2 and viruses from pangolin SARSr-CoVs are likely caused by mutations and natural selection besides recombination. Population genetic analyses of 103 SARS-CoV-2 genomes indicated that these viruses evolved into two major types (designated L and S), that are well defined by two different SNPs that show nearly complete linkage across the viral strains sequenced to date. Although the L type (~70%) is more prevalent than the S type (~30%), the S type was found to be the ancestral version. Whereas the L type was more prevalent in the early stages of the outbreak in Wuhan, the frequency of the L type decreased after early January 2020. Human intervention may have placed more severe selective pressure on the L type, which might be more aggressive and spread more quickly. On the other hand, the S type, which is evolutionarily older and less aggressive, might have increased in relative frequency due to relatively weaker selective pressure. These findings strongly support an urgent need for further immediate, comprehensive studies that combine genomic data, epidemiological data, and chart records of the clinical symptoms of patients with coronavirus disease 2019 (COVID-19).Keywords: SARS-CoV-2, virus, molecular evolution, population genetics Received: 25-Feb-2020; Revised: 28-Feb-2020; Accepted: 29-Feb-2020.INTRODUCTION The coronavirus disease 2019 (COVID-19) epidemic started in late December 2019 in Wuhan, the capital of Central China's Hubei Province. Since then, it has rapidly spread across China and in other countries, raising major global concerns. The etiological agent is a novel coronavirus, SARS-CoV-2, named for the similarity of its symptoms to those induced by the severe acute respiratory syndrome. As of February 28, 2020, 78,959 cases of SARS-CoV-2 infection have been confirmed in China, with 2,791 deaths. Worryingly, there have also been more than 3,664 confirmed cases outside of China in 46 countries and areas (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/),raising significant doubts about the likelihood of successful containment. Further, the genomic sequences of SARS-CoV-2 viruses isolated from a number of patients share sequence identity higher than 99.9%, suggesting a very recent host shift into humans [1-3].Coronaviruses are naturally hosted and evolutionarily shaped by bats [4, 5]. Indeed, it has been postulated that most of the coronaviruses in humans are derived from the bat reservoir [6, 7]. Unsurprisingly, several teams have recently confirmed the genetic similarity between SARS-CoV-2 and a bat betacoronavirus of the sub-genus Sarbecovirus [8-13]. The whole-genome sequence identity of the novel virus has 96.2% similarity to a bat SARS-related coronavirus (SARSr-CoV; RaTG13) collected in Yunnan province, China [2, 14], but is not very similar to the genomes of SARS-CoV (about 79%) or MERS-CoV (about 50%) [1, 15]. It has also been confirmed that the SARS-CoV-2 uses the same receptor, the angiotensin converting enzyme II (ACE2), as the SARS-CoV [11]. Although the specific route of transmission from natural reservoirs to humans remains unclear [5, 13], several studies have shown that pangolins may have provided a partial spike gene to SARS-CoV-2; the critical functional sites in the spike protein of SAR-CoV-2 are nearly identical to one identified in a virus isolated from a pangolin [16-18].Despite these recent discoveries, several fundamental issues related to the evolutionary patterns and driving forces behind this outbreak of SARS-CoV-2 remain unexplored [19]. Herein, we investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses and carried out population genetic analyses of 103 sequenced genomes of SARS-CoV-2. This work provides new insights into the factors driving the evolution of SARS-CoV-2 and its pattern of spread through the human population.RESULTSMolecular phylogeny and divergence between SARS-CoV-2 and related coronaviruses. For each annotated ORF in the reference genome of SARS-CoV-2 (NC_045512), we extracted the orthologous sequences in human SARS-CoV, four batSARS-related coronaviruses (SARSr-CoV: RaTG13, ZXC21, ZC45, and BM48-31), one Pangolin SARSr-CoV from Guangdong (GD) [17], and six Pangolin SARSr-CoV genomes from Guangxi (GX) [18] (Table S1). We aligned the coding sequences (CDSs) based on the protein alignments (see Materials and Methods). Most ORFs annotated from SARS-CoV-2 were found to be conserved in other viruses, except for ORF8 and ORF10 (Table 1). The protein sequence of SARS-CoV-2 ORF8 shared very low similarity with sequences in SARS-CoV and BM48-31, and ORF10 had a premature stop codon in both SARS-CoV and BM48-31 (Fig. S1). A one-base deletion caused a frame-shift mutation in ORF10 of ZXC21 (Fig. S1).To investigate the phylogenetic relationships between these viruses at the genomic scale, we concatenated coding regions (CDSs) of the nine conserved ORFs (orf1ab, E, M, N, S, ORF3a, ORF6, ORF7a, and ORF7b) and reconstructed the phylogenetic tree using the synonymous sites (Fig. 1A). We also used CODEML in the PAML [20] to infer the ancestral sequence of each node and calculated the dN (nonsynonymous substitutions per nonsynonymous site), dS (synonymous substitutions per synonymous site), and dN/dS (ω) values for each branch (Fig. 1A). In parallel, we also calculated the pairwise dN, dS, and ω values between SARS-CoV-2 and another virus (Table 1).The genome-wide phylogenetic tree indicated that SARS-CoV-2 was closest to RaTG13, followed by GD Pangolin SARSr-CoV, then by GX Pangolin SARSr-CoVs, then by ZC45 and ZXC21, then by human SARS-CoV, and finally by BM48-31(Fig. 1A). Notably, we found that the nucleotide divergence at synonymous sites between SARS-CoV-2 and other viruses was much higher than previously anticipated. For example, although the overall genomic nucleotides overall differ ~4% between SARS-CoV-2 and RaTG13, the genomic average dS was 0.17, which means the divergence at the neutral sites is 17% between these two viruses (Table 1). This is because the nonsynonymous sites are usually under stronger negative selection than synonymous sites, and calculating sequence differences without separating these two classes of sites may underestimate the extent of molecular divergence by several folds.Notably, the dS value varied considerably across genes in SARS-CoV-2 and the other viruses analyzed. In particular, the spike gene (S) consistently exhibited larger dS values than other genes (Table 1). This pattern became clear when we calculated the dS value for each branch in Fig. 1A for the spike gene versus the concatenated sequences of the remaining genes (Fig. S2). In each branch, the dS of spike was 2.22 ± 1.35 (mean ± SD) times as large as that of the other genes. This extremely elevated dS value of spike could be caused either by a high mutation rate or by natural selection that favors synonymous substitutions. Synonymoussubstitutions may serve as another layer of genetic regulation, guiding the efficiency of mRNA translation by changing codon usage [21]. If positive selection is the driving force for the higher synonymous substation rate seen in spike, we expect the frequency of optimal codons (FOP) of spike to be different from that of other genes. However, our codon usage bias analysis (Table S2) suggests the FOP of spike was only slightly higher than that of the genomic average (0.717 versus 0.698, see Materials and Methods). Thus, we believe that the elevated synonymous substitution rate measured in spike is more likely caused by higher mutational rates; however, the underlying molecular mechanism remains unclear.Both SARS-CoV and SARS-CoV-2 bind to ACE2 through the RBD of spike protein in order to initiate membrane fusion and enter human cells [1, 2, 22-26]. Five out of the six critical amino acid (AA) residues in RBD were different between SARS-CoV-2 and SARS-CoV (Fig. 1B), and a 3D structural analysis indicated that the spike of SARS-CoV-2 has a higher binding affinity to ACE2 than SARS-CoV [23]. Intriguingly, these same six critical AAs are identical between GD Pangolin-CoV and SARS-CoV-2 [16]. In contrast, although the genomes of SARS-CoV-2 and RaTG13 are more similar overall, only one out of the six functional sites are identical between the two viruses (Fig. 1B). It has been proposed that the SARS-CoV-2 RBD region of the spike protein might have resulted from recent recombination events in pangolins [16-18]. Although several ancient recombination events have been described in spike [27, 28], it also seems likely that the identical functional sites in SARS-CoV-2 and GD Pangolin-CoV may actually the result of coincidental convergent evolution [18].If the functional AA residues in the SARS-CoV-2 RBD region were acquired from GD Pangolin-CoV in a very recent recombination event, we would expect the nucleotide sequences of this region to be nearly identical between the two viruses. However, for the CDS sequences that span five critical AA sites in the SARS-CoV-2 spike (ranging from codon 484 to 507, covering five adjacent functional sites: F486, Q493, S494, N501, and Y505; Fig. S3), we estimated dS = 0.411, dN = 0.019, and ω= 0.046 between SARS-CoV-2 and GD Pangolin-CoV. By assuming the synonymous substitution rate (u) of 1.67-4.67 x 10-3/site/year, as estimated in SARS-CoV [29], the recombination/introgression, if it occurred at all, would be estimated to happen approximately 19.8-55.4 years ago. Here, the formulawas used to calculate divergence time; note that the increased mutational rate of spike was considered for this calculation. Thus, it seems very unlikely that SARS-CoV-2 originated from the GD Pangolin-CoV due to a very recent recombination event. Alternatively, it seems more likely that a high mutation rate in spike, coupled with strong natural selection, has shaped the identical functional AA residues between these two viruses, as proposed previously [18]. Although these sites are maintained in SARS-CoV-2 and GDPDF VERSIONhttps://academic.oup.com/nsr/advance-article-pdf/doi/10.1093/nsr/nwaa036/32757241/nwaa036.pdf
Comments