Wolfram Research

Genetic Sequences for the SARS-CoV-2 Coronavirus

Released nucleotide sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time, and similar supporting data

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: “Accession”, ”Species”, ”Genus”, ”Family”, ”Length”, ”GeographicLocation”, ”Host”, ”Sequence”, ”CollectionDate”, ”NucleotideStatus”, ”GenBankTitle”, and ”IsolationSource”.
Additional content elements include:
"CollectionTimeline" a TimelinePlot of when the sequences were collected
"ReleaseTimeline" a TimelinePlot of when the sequences were released
"AffectedLocations" a world map showing where these sequences were collected

Examples

Basic Examples

Get a Dataset containing rows for each sequence:

In[1]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[1]=

Return the latest date a sequence was released:

In[2]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Max, \
"ReleaseDate"]
Out[2]=

Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced:

In[3]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Counts, \
"Length"]
Out[3]=

The lengths of sequences break down into two categories, corresponding to more complete sequences versus specific genetic regions:

In[4]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Histogram, \
"Length"]
Out[4]=

Scope & Additional Elements

Get a timeline plot of collection dates:

In[5]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"CollectionTimeline"]
Out[5]=

See a timeline plot of release dates:

In[6]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"ReleaseTimeline"]
Out[6]=

Show the locations where the sequences were gathered:

In[7]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"AffectedLocations"]
Out[7]=

Visualizations

A phylogenetic tree comparison of complete genomes implies that while blocks of occurrences around China, the United States, and Japan are very similar, later occurrences are diverging as the virus spreads and mutates, with the greatest difference observed in a sample from South Korea. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:

In[8]:=
dropTrailingA[seq_] := StringReplace[seq, StartOfString ~~ Shortest[a__] ~~ ("A" ..) ~~ EndOfString :> a];
Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Row@(Rest@#)} & /@ (Values /@ Normal[ResourceData[
        "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
       Select[StringContainsQ[#GenBankTitle, "complete genome"] &], {"Sequence", "GeographicLocation", "CollectionDate"}]]
    )]
 ]
Out[9]=

Analysis

Observations of the content of genetic differences by location suggest that China has seen the most viral evolution, but each location has their unique strains. Though the previous visualization shows the South Korean sample as containing the greatest divergence, all of its differences concern single nucleotide replacements:

In[10]:=
originalReferenceSequence = dropTrailingA@
   Normal[ResourceData[
        "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
       Select[#Accession === "NC_045512" &]]][[1]]["Sequence"];
In[11]:=
mutationSet = ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
     Select[StringContainsQ[#GenBankTitle, "complete genome"] &]][All,
     Append[#, <|
       "Mutations" -> (Rule @@@ (Select[
            SequenceAlignment[originalReferenceSequence, dropTrailingA@#Sequence], ListQ]))|>] &][
   All, {"GeographicLocation", "Mutations"}];
mutationsByLocation = Normal[mutationSet[GroupBy["GeographicLocation"], Catenate, "Mutations"][All, Grid[(List /@ Union[#]), Alignment -> Left] &]];
In[12]:=
Keys[mutationsByLocation]
Out[12]=
In[13]:=
mutationsByLocation[Entity["Country", "China"]]
Out[13]=
In[14]:=
mutationsByLocation[Entity["Country", "Australia"]]
Out[14]=
In[15]:=
mutationsByLocation[Entity["Country", "Japan"]]
Out[15]=
In[16]:=
mutationsByLocation[Entity["Country", "SouthKorea"]]
Out[16]=

Wolfram Research, "Genetic Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2020) https://doi.org/10.24097/wolfram.03304.data

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Data Downloads

Publisher Information