Wolfram Research

Genetic Sequences for the SARS-CoV-2 Coronavirus

Source Notebook

Nucleotide sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data. (This data was imported and made computable at 6 am CDT on Oct. 28, 2020.)

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: “Accession”, “Length”, “Authors”, “Publications”, “GeographicLocation”, “DetailedGeographicLocation”, "USState", “Host”, “Sequence”, “CollectionDate”, “ReleaseDate”, “InclusionDate”, “SequenceType”, “NucleotideStatus”, “GenBankTitle”, “IsolationSource” and “BioSample”.
Additional content elements include:
"CollectionHistogram" a DateHistogram of when the sequences were collected
"ReleaseHistogram" a DateHistogram of when the sequences were released to the public
"InclusionHistogram" a DateHistogram of when the sequences were included in the source for this data
"AffectedLocations" a world map showing where these sequences were collected
"AlignmentDifferences" a Dataset containing alignment differences with the reference sequence

Examples

Basic Examples

Get a Dataset containing rows for each sequence:

In[1]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[1]=

Return the latest date a sequence was included:

In[2]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Max, \
"InclusionDate"]
Out[2]=

Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced:

In[3]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
  Counts, "Length"][ReverseSort]
Out[3]=

Though some small partial sequences are included, most are around the length of the complete viral genome:

In[4]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Histogram, \
"Length"]
Out[4]=

Most of these SARS-CoV-2 samples are collected from humans, but not all:

In[5]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
  Select[Not[MissingQ[#Host]] &]][Counts, "Host"]
Out[5]=

Scope & Additional Elements

Get a date histogram of collection dates:

In[6]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"CollectionHistogram"]
Out[6]=

See a date histogram of release dates:

In[7]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"ReleaseHistogram"]
Out[7]=

See a date histogram of inclusion dates:

In[8]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"InclusionHistogram"]
Out[8]=

Show the locations where the sequences were gathered:

In[9]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"AffectedLocations"]
Out[9]=

Obtain the available alignment differences with the reference sequence:

In[10]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"AlignmentDifferences"]
Out[10]=

Visualizations

A phylogenetic tree comparison of the most-common complete genomes by location shows clusters that are broadly distributed. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:

In[11]:=
dropTrailingA[seq_] := StringReplace[seq, StartOfString ~~ Shortest[a__] ~~ ("A" ..) ~~ EndOfString :> a];
sampleListFirstByMostCommon[lists : {___List}] := DeleteCases[
   Prepend[#[[1, 2 ;; -1]], ReverseSortBy[Tally[First /@ #], Last][[1, 1]]] & /@ GatherBy[lists, Rest], {_, _Missing}];
treePlot = Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Row@(Rest@#)} & /@ sampleListFirstByMostCommon[{#[[1]], #[[2]]} &@*Values /@ Normal[ResourceData[
         "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
        Select[And[
           StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
             Entity["Species", "Species:HomoSapiens"]] &], {"Sequence", "GeographicLocation"}]]]]
  ]
Out[13]=
In[14]:=
GeoGraphics[{MapIndexed[
   Splice[{ColorData[3][#2[[1]]], Splice[Polygon /@ #1]}] &,
   {{
Entity["Country", "Bahrain"], 
Entity["Country", "Belize"], 
Entity["Country", "Chile"], 
Entity["Country", "China"], 
Entity["Country", "CzechRepublic"], 
Entity["Country", "Finland"], 
Entity["Country", "Germany"], 
Entity["Country", "Greece"], 
Entity["Country", "Japan"], 
Entity["Country", "Kazakhstan"], 
Entity["Country", "Lebanon"], 
Entity["Country", "Nepal"], 
Entity["Country", "Netherlands"], 
Entity["Country", "NewZealand"], 
Entity["Country", "Nigeria"], 
Entity["Country", "Pakistan"], 
Entity["Country", "Poland"], 
Entity["Country", "Russia"], 
Entity["Country", "SaudiArabia"], 
Entity["Country", "Serbia"], 
Entity["Country", "Turkey"], 
Entity["Country", "UnitedStates"], 
Entity["Country", "Zambia"]}, {
Entity["Country", "Bangladesh"], 
Entity["Country", "Brazil"], 
Entity["Country", "Colombia"], 
Entity["Country", "EastTimor"], 
Entity["Country", "Egypt"], 
Entity["Country", "France"], 
Entity["Country", "Georgia"], 
Entity["Country", "Guam"], 
Entity["Country", "Guatemala"], 
Entity["Country", "Israel"], 
Entity["Country", "Jamaica"], 
Entity["Country", "Malaysia"], 
Entity["Country", "Morocco"], 
Entity["Country", "SouthAfrica"], 
Entity["Country", "SouthKorea"], 
Entity["Country", "Spain"], 
Entity["Country", "SriLanka"], 
Entity["Country", "Sweden"], 
Entity["Country", "Tunisia"], 
Entity["Country", "UnitedKingdom"], 
Entity["Country", "Uruguay"], 
Entity["Country", "Vietnam"], 
Entity["Country", "Taiwan"]}, {
Entity["Country", "Jordan"], 
Entity["Country", "Mexico"], 
Entity["Country", "Peru"]}, {
Entity["Country", "Ghana"], 
Entity["Country", "SierraLeone"]}, Splice[List /@ {
Entity["Country", "Australia"], 
Entity["Country", "HongKong"], 
Entity["Country", "India"], 
Entity["Country", "Iran"], 
Entity["Country", "Iraq"], 
Entity["Country", "Italy"], 
Entity["Country", "Malta"], 
Entity["Country", "Philippines"], 
Entity["Country", "PuertoRico"], 
Entity["Country", "Thailand"], 
Entity["Country", "Venezuela"]}]}]}]
Out[14]=

A similar visualization can be created for samples where more detailed geographic information is supplied. In this visualization of most-common sequences reported for US states, we see the emergence of clusters containing interesting regional blocks as shown in the map below:

In[15]:=
Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Last[#]} & /@ sampleListFirstByMostCommon[{#[[1]], #[[2]]} &@*Values /@ Normal[ResourceData[
        "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
       Select[And[
          StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
            Entity["Species", "Species:HomoSapiens"],
          Not[MissingQ[#USState]]] &], {"Sequence", "USState"}]]]]
 ]
Out[15]=
In[16]:=
GeoGraphics[{MapIndexed[
   Splice[{ColorData[3][#2[[1]]], Splice[Polygon /@ #1]}] &, {{
Entity["AdministrativeDivision", {"Alabama", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Arizona", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Georgia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Iowa", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Kansas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Kentucky", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Massachusetts", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Missouri", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Montana", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Nebraska", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewHampshire", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewMexico", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Ohio", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Oregon", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Pennsylvania", "UnitedStates"}], 
Entity["AdministrativeDivision", {"RhodeIsland", "UnitedStates"}], 
Entity["AdministrativeDivision", {"SouthCarolina", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Vermont", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Virginia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Washington", "UnitedStates"}]},
    {
Entity["AdministrativeDivision", {"Arkansas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Colorado", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Connecticut", "UnitedStates"}], 
Entity["AdministrativeDivision", {
      "DistrictOfColumbia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Florida", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Idaho", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Illinois", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Indiana", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Maine", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Maryland", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Michigan", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Minnesota", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Nevada", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewJersey", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewYork", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NorthCarolina", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Oklahoma", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Tennessee", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Texas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Utah", "UnitedStates"}], 
Entity["AdministrativeDivision", {"WestVirginia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Wisconsin", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Wyoming", "UnitedStates"}]}, {
Entity["AdministrativeDivision", {"California", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Louisiana", "UnitedStates"}]}, Splice[List /@ {
Entity["AdministrativeDivision", {
        "Mississippi", "UnitedStates"}]}]}]}]
Out[16]=

When visualizing the similarity of the most common sequence by week of sequence collection, while one does observe that similar times tend to cluster together, there is some overlap (such as between the week of Dec. 30, 2019 and the week of Feb. 17, 2020), illustrating that the virus has not only seen evolution, but significant continuity:

In[17]:=
sampleListFirstByMostCommon[lists : {___List}] := DeleteCases[
   Prepend[#[[1, 2 ;; -1]], ReverseSortBy[Tally[First /@ #], Last][[1, 1]]] & /@ GatherBy[lists, Rest], {_, DateObject[_Missing, "Week"]}];
treePlot = Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Row@(Rest@#)} & /@ sampleListFirstByMostCommon[{#[[1]], DateObject[#[[2]], "Week"]} &@*Values /@ Normal[ResourceData[
         "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
        Select[And[
           StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
             Entity["Species", "Species:HomoSapiens"]] &], {"Sequence", "CollectionDate"}]]]]
  ]
Out[18]=

Analysis

Using the provided alignment differences, we can see where along the viral genome changes have been detected over time. We see that while mutations are relatively uniformly distributed, there are certainly changes more commonly measured than others:

In[19]:=
minDate = ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
    Select[StringContainsQ[#GenBankTitle, "complete genome"] &]][Min, "CollectionDate"];
accessionToDiffList = <|
   Rule @@@ Normal[Values /@ ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus",
        "AlignmentDifferences"]]|>;
positionCountPerCollectionDay = Sort[Flatten[
    With[{day = #[[1, 1]]}, Prepend[#, day] & /@ Tally[Flatten[Last /@ #]]] & /@ GatherBy[{Ceiling[
          QuantityMagnitude[
           DateDifference[minDate, #[[1]]]]], (#["Position"]) & /@ accessionToDiffList[#[[2]]]} & /@ Normal[Values /@ ResourceData[
            "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
           Select[And[Not[MissingQ[#CollectionDate]], KeyExistsQ[accessionToDiffList, #Accession]] &]
           ][[All, {"CollectionDate", "Accession"}]]],
      First], 1]];
ListPlot3D[positionCountPerCollectionDay,
 AxesLabel -> {"Days of Reported Data", "Genetic Position", "Changes Counted\nat Position"}, PlotRange -> All
 ]
Out[22]=

It is also possible to treat these genetic differences as lists of features:

In[23]:=
diffToStringFeature[diff_Association] := ToString[diff["Position"]] <> ":" <> diff["Reference"] <> ">" <> diff["Variation"];
diffToStringFeature[diffs : {___Association}] := Map[diffToStringFeature, diffs];
accessionToFeatureList = diffToStringFeature /@ accessionToDiffList;
accessionToFeatureList // Keys // First // accessionToFeatureList
Out[26]=

By doing so, it is possible to perform a fairly wide variety of analysis. Here, we determine all of the genetic differences that always occur together in the sampled sequences, taking advantage of the fact that when differences always occur together they must occur in the same number of sequences:

In[27]:=
featureIndex = <|
   Rule[#[[1, 1]], Last /@ #] & /@ GatherBy[
     Flatten[Function[{diffList}, {#, diffList} & /@ diffList] /@ Values[accessionToFeatureList], 1], First]|>;
termCountIndex = <|
   Rule[#[[1, 1]], Last /@ #] & /@ GatherBy[{Length[featureIndex[#]], #} & /@ Keys[featureIndex], First]|>;
doesAlwaysCoOccur[firstTerm_, secondTerm_] := AllTrue[featureIndex[firstTerm], MemberQ[#, secondTerm] &];
findCoOccurPairs[terms_List] := Module[{firstTermResults, remainingTerms = terms, firstTerm, restTerms},
  Flatten[Reap[While[remainingTerms =!= {},
      firstTerm = First[remainingTerms];
      restTerms = Rest[remainingTerms];
      firstTermResults = Rule[firstTerm, #] & /@ Select[restTerms, doesAlwaysCoOccur[firstTerm, #] &];
      Sow[firstTermResults];
      remainingTerms = Complement[restTerms, Last /@ firstTermResults];
      ]][[2, 1]], 1]
  ]
coOccurenceSets =
  WeaklyConnectedComponents[
   Flatten[
    findCoOccurPairs[termCountIndex[#]] & /@ Keys[termCountIndex], 1]];
ResourceFunction["NiceGrid"][
 Row[Sort[#], ", "] & /@ Take[coOccurenceSets, 5], Alignment -> Left]
Out[32]=

Wolfram Research, "Genetic Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2020) https://doi.org/10.24097/wolfram.03304.data

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Data Downloads

Publisher Information