Genetic Sequences for the SARS-CoV-2 Coronavirus

Source Notebook

Nucleotide sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: "Accession", "Length", "Publications", "GeographicLocation", "DetailedGeographicLocation", "USState", "Host", "Sequence", "CollectionDate", "ReleaseDate", "SequenceType", "NucleotideStatus", "GenBankTitle", "IsolationSource", "BioSample", "PangoLineage" and "WHONamedVariant".
Additional content elements include:
"LatestData"a Dataset containing the most recently collected data
"CollectionHistogram"a DateHistogram of when the sequences were collected
"ReleaseHistogram"a DateHistogram of when the sequences were released to the public
"AffectedLocations"a world map showing where these sequences were collected
"SubmissionAuthors"a Dataset containing the accessions for each author list
"AlignmentDifferences"a Dataset containing alignment differences with the reference sequence
"ReferenceBioSequence"a BioSequence representing the reference SARS-CoV-2 genome

Examples

Basic Examples

Get a Dataset containing rows for the most recent sequences:

In[1]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"LatestData"]
Out[1]=

Get a Dataset containing rows for all sequences (this can take considerable time to download and expand):

In[2]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[2]=

Return the latest date a sequence was released:

In[3]:=
ResourceData[
  "Genetic Sequences for the SARS-CoV-2 Coronavirus"][Max, \
"ReleaseDate"]
Out[3]=

Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced:

In[4]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
  Counts, "Length"][ReverseSort]
Out[4]=

Most of these SARS-CoV-2 samples are collected from humans, but not all:

In[5]:=
ReverseSort[
 ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
   Select[Not[MissingQ[#Host]] &]][Counts, "Host"]]
Out[5]=

Some of these genetic sequences correspond to named variations of interest as designated by the World Health Organization (WHO):

In[6]:=
ReverseSort[
 ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
   Select[Not[MissingQ[#WHONamedVariant]] &]][Counts, "WHONamedVariant"]]
Out[6]=

Scope & Additional Elements

Get a date histogram of collection dates:

In[7]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"CollectionHistogram"]
Out[7]=

See a date histogram of release dates:

In[8]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"ReleaseHistogram"]
Out[8]=

Show the locations where the sequences were gathered:

In[9]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"AffectedLocations"]
Out[9]=

Obtain the available alignment differences with the reference sequence:

In[10]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"AlignmentDifferences"]
Out[10]=

Show the authors with the accessions of the sequences they submitted:

In[11]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"SubmissionAuthors"]
Out[11]=

Obtain the reference sequence as a biomolecular sequence:

In[12]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", \
"ReferenceBioSequence"]
Out[12]=

Visualizations

A phylogenetic tree comparison of the most-common complete genomes by location shows clusters that are broadly distributed. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:

In[13]:=
dropTrailingA[seq_] := StringReplace[seq, StartOfString ~~ Shortest[a__] ~~ ("A" ..) ~~ EndOfString :> a];
sampleListFirstByMostCommon[lists : {___List}] := DeleteCases[
   Prepend[#[[1, 2 ;; -1]], ReverseSortBy[Tally[First /@ #], Last][[1, 1]]] & /@ GatherBy[lists, Rest], {_, _Missing}];
treePlot = Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Row@(Rest@#)} & /@ sampleListFirstByMostCommon[{#[[1]], #[[2]]} &@*Values /@ Normal[ResourceData[
         "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
        Select[And[
           StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
             Entity["Species", "Species:HomoSapiens"]] &], {"Sequence", "GeographicLocation"}]]]]
  ]
Out[15]=
In[16]:=
GeoGraphics[{MapIndexed[
   Splice[{ColorData[60][#2[[1]]], Splice[Polygon /@ #1]}] &,
   {{
Entity["Country", "Argentina"], 
Entity["Country", "China"], 
Entity["Country", "Chile"], 
Entity["Country", "Finland"], 
Entity["Country", "Germany"], 
Entity["Country", "Kenya"], 
Entity["Country", "Nepal"], 
Entity["Country", "NewZealand"], 
Entity["Country", "Spain"], 
Entity["Country", "UnitedStates"]}, {
Entity["Country", "Belize"], 
Entity["Country", "CzechRepublic"], 
Entity["Country", "Gambia"], 
Entity["Country", "Greece"], 
Entity["Country", "HongKong"], 
Entity["Country", "Israel"], 
Entity["Country", "Japan"], 
Entity["Country", "Kazakhstan"], 
Entity["Country", "Lebanon"], 
Entity["Country", "Nigeria"], 
Entity["Country", "Poland"], 
Entity["Country", "Russia"], 
Entity["Country", "Serbia"], 
Entity["Country", "Turkey"], 
Entity["Country", "Zambia"]}, {
Entity["Country", "Bangladesh"], 
Entity["Country", "Colombia"], 
Entity["Country", "Egypt"], 
Entity["Country", "France"], 
Entity["Country", "Georgia"], 
Entity["Country", "Guatemala"], 
Entity["Country", "Iraq"], 
Entity["Country", "Jamaica"], 
Entity["Country", "Mali"], 
Entity["Country", "Morocco"], 
Entity["Country", "Pakistan"], 
Entity["Country", "SouthAfrica"], 
Entity["Country", "SriLanka"], 
Entity["Country", "Taiwan"], 
Entity["Country", "Vietnam"]}, {
Entity["Country", "Benin"], 
Entity["Country", "Italy"], 
Entity["Country", "Jordan"], 
Entity["Country", "Libya"], 
Entity["Country", "Mexico"], 
Entity["Country", "Peru"]}, {
Entity["Country", "Austria"], 
Entity["Country", "Cambodia"], 
Entity["Country", "DominicanRepublic"], 
Entity["Country", "Ghana"], 
Entity["Country", "Guam"], 
Entity["Country", "PuertoRico"], 
Entity["Country", "Togo"], 
Entity["Country", "UnitedKingdom"], 
Entity["Country", "UnitedStatesVirginIslands"], 
Entity["Country", "WestBank"]}, {
Entity["Country", "India"], 
Entity["Country", "Portugal"]},
    {
Entity["Country", "EastTimor"], 
Entity["Country", "Malaysia"]}, Splice[List /@ {
Entity["Country", "Australia"], 
Entity["Country", "Bahrain"], 
Entity["Country", "Belarus"], 
Entity["Country", "Belgium"], 
Entity["Country", "Brazil"], 
Entity["Country", "Canada"], 
Entity["Country", "Denmark"], 
Entity["Country", "Ecuador"], 
Entity["Country", "Ethiopia"], 
Entity["Country", "Iran"], 
Entity["Country", "Malta"], 
Entity["Country", "Myanmar"], 
Entity["Country", "Netherlands"], 
Entity["Country", "NorthernMarianaIslands"], 
Entity["Country", "Philippines"], 
Entity["Country", "SaudiArabia"], 
Entity["Country", "SierraLeone"], 
Entity["Country", "SouthKorea"], 
Entity["Country", "Sweden"], 
Entity["Country", "Thailand"], 
Entity["Country", "Tunisia"], 
Entity["Country", "Uganda"], 
Entity["Country", "Uruguay"], 
Entity["Country", "Uzbekistan"], 
Entity["Country", "Venezuela"]}]}]}]
Out[16]=

A similar visualization can be created for samples where more detailed geographic information is supplied. In this visualization of most-common sequences reported for US states, we see the emergence of clusters containing interesting regional blocks as shown in the map below:

In[17]:=
Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Last[#]} & /@ sampleListFirstByMostCommon[{#[[1]], #[[2]]} &@*Values /@ Normal[ResourceData[
        "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
       Select[And[
          StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
            Entity["Species", "Species:HomoSapiens"],
          Not[MissingQ[#USState]]] &], {"Sequence", "USState"}]]]]
 ]
Out[17]=
In[18]:=
GeoGraphics[{MapIndexed[
   Splice[{ColorData[3][#2[[1]]], Splice[Polygon /@ #1]}] &, {{
Entity["AdministrativeDivision", {"Arizona", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Arkansas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Connecticut", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Florida", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Indiana", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Kentucky", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Maryland", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Massachusetts", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Missouri", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Nevada", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewMexico", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Pennsylvania", "UnitedStates"}], 
Entity["AdministrativeDivision", {"RhodeIsland", "UnitedStates"}], 
Entity["AdministrativeDivision", {"SouthCarolina", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Tennessee", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Texas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Virginia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"WestVirginia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Vermont", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Wyoming", "UnitedStates"}]}, {
Entity["AdministrativeDivision", {"Alabama", "UnitedStates"}], 
Entity["AdministrativeDivision", {"California", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Colorado", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Delaware", "UnitedStates"}], 
Entity["AdministrativeDivision", {
      "DistrictOfColumbia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Georgia", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Idaho", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Iowa", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Kansas", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Louisiana", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Maine", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Michigan", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Minnesota", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Mississippi", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Montana", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Nebraska", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewHampshire", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewJersey", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NewYork", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NorthCarolina", "UnitedStates"}], 
Entity["AdministrativeDivision", {"NorthDakota", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Oklahoma", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Oregon", "UnitedStates"}], 
Entity["AdministrativeDivision", {"SouthDakota", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Utah", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Washington", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Wisconsin", "UnitedStates"}]}, {
Entity["AdministrativeDivision", {"Illinois", "UnitedStates"}], 
Entity["AdministrativeDivision", {"Ohio", "UnitedStates"}]}, Splice[List /@ {}]}]}]
Out[18]=

When visualizing the similarity of the most common sequence by month of sequence collection, there are recurring overlaps (most significantly between December 2019 and February 2020), illustrating that the virus has not only seen evolution, but significant continuity. Since then, greater spread has led to further divergence:

In[19]:=
sampleListFirstByMostCommon[lists : {___List}] := DeleteCases[
   Prepend[#[[1, 2 ;; -1]], ReverseSortBy[Tally[First /@ #], Last][[1, 1]]] & /@ GatherBy[lists, Rest], {_, DateObject[_Missing, "Month"]}];
treePlot = Apply[ResourceFunction["PhylogeneticTreePlot"], Transpose[{dropTrailingA@First[#], Row@(Rest@#)} & /@ sampleListFirstByMostCommon[{#[[1]], DateObject[#[[2]], "Month"]} &@*Values /@ Normal[ResourceData[
         "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
        Select[And[
           StringContainsQ[#GenBankTitle, "complete genome"], #Host ===
             Entity["Species", "Species:HomoSapiens"]] &], {"Sequence", "CollectionDate"}]]]]
  ]
Out[20]=

Analysis

Using the provided alignment differences, we can see where along the viral genome changes have been detected over time. We see that while mutations are relatively uniformly distributed, there are certainly changes more commonly measured than others:

In[21]:=
minDate = ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][
    Select[StringContainsQ[#GenBankTitle, "complete genome"] &]][Min, "CollectionDate"];
accessionToDiffList = <|
   Rule @@@ Normal[Values /@ ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus",
        "AlignmentDifferences"]]|>;
positionCountPerCollectionDay = Sort[Flatten[
    With[{day = #[[1, 1]]}, Prepend[#, day] & /@ Tally[Flatten[Last /@ #]]] & /@ GatherBy[{Ceiling[
          QuantityMagnitude[
           DateDifference[minDate, #[[1]]]]], (#["Position"]) & /@ accessionToDiffList[#[[2]]]} & /@ Normal[Values /@ ResourceData[
            "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
           Select[And[Not[MissingQ[#CollectionDate]], KeyExistsQ[accessionToDiffList, #Accession]] &]
           ][[All, {"CollectionDate", "Accession"}]]],
      First], 1]];
ListPlot3D[positionCountPerCollectionDay,
 AxesLabel -> {"Days of Reported Data", "Genetic Position", "Changes Counted\nat Position"}, PlotRange -> All
 ]
Out[22]=

It is also possible to treat these genetic differences as lists of features:

In[23]:=
diffToStringFeature[diff_Association] := ToString[diff["Position"]] <> ":" <> diff["Reference"] <> ">" <> diff["Variation"];
diffToStringFeature[diffs : {___Association}] := Map[diffToStringFeature, diffs];
accessionToFeatureList = diffToStringFeature /@ accessionToDiffList;
accessionToFeatureList // Keys // First // accessionToFeatureList
Out[24]=

By doing so, it is possible to perform a fairly wide variety of analysis. Here, we determine all of the genetic differences that always occur together in the sampled sequences, taking advantage of the fact that when differences always occur together they must occur in the same number of sequences:

In[25]:=
featureIndex = <|
   Rule[#[[1, 1]], Last /@ #] & /@ GatherBy[
     Flatten[Function[{diffList}, {#, diffList} & /@ diffList] /@ Values[accessionToFeatureList], 1], First]|>;
termCountIndex = <|
   Rule[#[[1, 1]], Last /@ #] & /@ GatherBy[{Length[featureIndex[#]], #} & /@ Keys[featureIndex], First]|>;
doesAlwaysCoOccur[firstTerm_, secondTerm_] := AllTrue[featureIndex[firstTerm], MemberQ[#, secondTerm] &];
findCoOccurPairs[terms_List] := Module[{firstTermResults, remainingTerms = terms, firstTerm, restTerms},
  Flatten[Reap[While[remainingTerms =!= {},
      firstTerm = First[remainingTerms];
      restTerms = Rest[remainingTerms];
      firstTermResults = Rule[firstTerm, #] & /@ Select[restTerms, doesAlwaysCoOccur[firstTerm, #] &];
      Sow[firstTermResults];
      remainingTerms = Complement[restTerms, Last /@ firstTermResults];
      ]][[2, 1]], 1]
  ]
coOccurenceSets =
  WeaklyConnectedComponents[
   Flatten[
    findCoOccurPairs[termCountIndex[#]] & /@ Keys[termCountIndex], 1]];
ResourceFunction["NiceGrid"][
 Row[Sort[#], ", "] & /@ Take[coOccurenceSets, 5], Alignment -> Left]
Out[26]=

Wolfram Research, "Genetic Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021)   https://doi.org/10.24097/wolfram.03304.data

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Publisher Information