Wolfram Research

Protein Sequences for the SARS-CoV-2 Coronavirus

Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data. (This data was imported and made computable at 3 pm CDT on July 2, 2020.)

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: “Accession”, “Length”, “Authors”, “Publications”, “GeographicLocation”, “DetailedGeographicLocation”, “Host”, “Sequence”, “CollectionDate”, “ReleaseDate”, "InclusionDate", “GenBankTitle”, “ProcessedTitle”, “SequenceType”, “ProteinStatus”, “IsolationSource” and “BioSample”.
Additional content elements include:
"CollectionHistogram" a DateHistogram of when the sequences were collected
"ReleaseHistogram" a DateHistogram of when the sequences were released to the public
"InclusionHistogram" a DateHistogram of when the sequences were included in the source for this data
"AffectedLocations" a world map showing where these sequences were collected

Examples

Basic Examples

Get a Dataset containing rows for each sequence:

In[1]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"]
Out[1]=

Return the latest date a sequence was included:

In[2]:=
ResourceData[
  "Protein Sequences for the SARS-CoV-2 Coronavirus"][Max, \
"InclusionDate"]
Out[2]=

Count proteins by the reported description:

In[3]:=
ReverseSortBy[
 ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
  Counts, "ProcessedTitle"], #2]
Out[3]=

Most of these protein sequences are collected from humans, but not all:

In[4]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
  Select[Not[MissingQ[#Host]] &]][Counts, "Host"]
Out[4]=

Scope & Additional Elements

Get a date plot of collection dates:

In[5]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"CollectionHistogram"]
Out[5]=

See a data histogram of release dates:

In[6]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"ReleaseHistogram"]
Out[6]=

See a timeline plot of inclusion dates:

In[7]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"InclusionHistogram"]
Out[7]=

Show the locations where the sequences were gathered:

In[8]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"AffectedLocations"]
Out[8]=

Visualizations

Most of the provided protein sequences come from the United States and Australia:

In[9]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#GeographicLocation]] &]][Counts, "GeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[9]=

When we look at the geographic locations providing protein sequences with the most common title, "nucleocapsid phosphoprotein", these proportions are largely maintained:

In[10]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[
      And[Not[MissingQ[#GeographicLocation]], #ProcessedTitle === "nucleocapsid phosphoprotein"] &]][
    Counts, "GeographicLocation"]], -20],
 ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[10]=

Most of the provided sequences come from Victoria, Australia and the states of the Unites States:

In[11]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#DetailedGeographicLocation]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[11]=

When we look at the detailed geographic locations providing protein sequences with the most common title, "nucleocapsid phosphoprotein", these proportions are again largely maintained:

In[12]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[And[#ProcessedTitle === "nucleocapsid phosphoprotein", Not[MissingQ[#DetailedGeographicLocation]]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[12]=

Analysis

In this data, different submitting parties may title the same protein differently. By gathering all of the titles that share a sequence, we can see that "spike glycoprotein" and "surface glycoprotein" are sometimes used to label the same protein, while the 'structural' designation is applied to matrix, envelope, and nucleocapsid proteins:

In[13]:=
Select[Union[Last /@ #] & /@ GatherBy[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][[
     All, {"Sequence", "ProcessedTitle"}]][
    Select[And[Not[StringStartsQ[#ProcessedTitle, "Chain"]], Not[StringContainsQ[#ProcessedTitle, "partial"]], Not[StringContainsQ[#ProcessedTitle, "truncated"]]] &]], First], Length[#] > 1 &]
Out[13]=

We can find unique candidate proteins by finding the most common name of the most common sequence per name and eliminating partial and chain sequences:

In[14]:=
mostCommonNameForMostCommonSequence = Dataset[<|"Most Common Name" -> First[#[[1]]], "Other Names" -> Rest[#[[1]]], "Most Common Sequence" -> #[[
       2]]|> & /@ ({Last /@ ReverseSortBy[#, #[[2]] &], #[[1, 1]]} & /@
      GatherBy[{Splice@
          First[ReverseSortBy[Tally[Last /@ #], Last]], #[[1, 1]]} & /@
        GatherBy[
        Normal[
         Values /@ ResourceData[
             "Protein Sequences for the SARS-CoV-2 Coronavirus"][
            Select[Not[
               StringContainsQ[#ProcessedTitle, "partial" | "Chain" | "truncated"]] &]
            ][[All, {"ProcessedTitle", "Sequence"}]]],
        First
        ], First])]
Out[14]=

Given these common sequences, we can plot where these sequences are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align with sequences with the translation of each potential frame shift and choose the best alignment:

In[15]:=
codonTranslations = <|"TTT" -> "F", "TTC" -> "F", "TTA" -> "L", "TTG" -> "L", "TCT" -> "S", "TCC" -> "S", "TCA" -> "S", "TCG" -> "S", "TAT" -> "Y", "TAC" -> "Y", "TAA" -> "*", "TAG" -> "*", "TGT" -> "C", "TGC" -> "C", "TGA" -> "*", "TGG" -> "W", "CTT" -> "L", "CTC" -> "L", "CTA" -> "L", "CTG" -> "L", "CCT" -> "P", "CCC" -> "P", "CCA" -> "P", "CCG" -> "P", "CAT" -> "H", "CAC" -> "H", "CAA" -> "Q", "CAG" -> "Q", "CGT" -> "R", "CGC" -> "R", "CGA" -> "R", "CGG" -> "R", "ATT" -> "I", "ATC" -> "I", "ATA" -> "I", "ATG" -> "M", "ACT" -> "T", "ACC" -> "T", "ACA" -> "T", "ACG" -> "T", "AAT" -> "N", "AAC" -> "N", "AAA" -> "K", "AAG" -> "K", "AGT" -> "S", "AGC" -> "S", "AGA" -> "R", "AGG" -> "R", "GTT" -> "V", "GTC" -> "V", "GTA" -> "V", "GTG" -> "V", "GCT" -> "A", "GCC" -> "A", "GCA" -> "A", "GCG" -> "A", "GAT" -> "D", "GAC" -> "D", "GAA" -> "E", "GAG" -> "E", "GGT" -> "G", "GGC" -> "G", "GGA" -> "G", "GGG" -> "G"|>; commonNameSequencePairs = {#[[1]], #[[3]]} & /@ Normal[Values /@ mostCommonNameForMostCommonSequence];
originalReferenceSequence = Normal[ResourceData[
       "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
      Select[#Accession === "NC_045512" &]]][[1]]["Sequence"];
translateDNASeqWithOffset[seq_, offset_] := StringJoin[
   codonTranslations /@ StringPartition[StringDrop[seq, offset], 3]];
shiftTranslations = translateDNASeqWithOffset[originalReferenceSequence, #] & /@ Range[0, 2];
extractAligningRange[alignment : {__}] := Module[{workingAlignment, startCoord, endCoord},
   If[MatchQ[Last[alignment], {_, ""}],
    workingAlignment = Most[alignment],
    workingAlignment = alignment
    ];
   If[MatchQ[First[workingAlignment], {_, ""}],
    startCoord = StringLength[workingAlignment[[1, 1]]] + 1;
    workingAlignment = Rest[workingAlignment],
    startCoord = 1
    ];
   endCoord = Total[
      StringLength[If[StringQ[#], #, First[#]]] & /@ workingAlignment] + startCoord - 1;
   {startCoord, endCoord}
   ];
localLength[alignment_List] := Max[StringLength[StringJoin[Last /@ Select[alignment, ListQ]]],
   StringLength[
    StringJoin[First /@ Select[alignment[[2 ;; -2]], ListQ]]]];
findAligningRangeFromTranslations[seq_] := Module[{alignments, bestShiftPos, localLengths},
   alignments = SequenceAlignment[#, seq, Method -> "Local"] & /@ shiftTranslations;
   localLengths = localLength /@ alignments;
   bestShiftPos = First[Ordering[localLengths]];
   If[localLengths[[bestShiftPos]]/StringLength[seq] < 0.1,
    ((extractAligningRange[alignments[[bestShiftPos]]] + bestShiftPos - 1)*3) - 2,
    None
    ]
   ];
alignmentRanges = SortBy[DeleteCases[{#[[1]], findAligningRangeFromTranslations[#[[2]]]} & /@ commonNameSequencePairs, {_, None}], MinMax[Last[#]] &];
Column[Map[Function[{plotBatch},
   NumberLinePlot[Reverse[Interval /@ plotBatch[[All, 2]]], PlotLegends -> Reverse[plotBatch[[All, 1]]], PlotRange -> {1, 29903}, ImageSize -> 400, AxesLabel -> "base pairs"]
   ],
  Partition[alignmentRanges, UpTo[8]]
  ]]
Out[23]=

Wolfram Research, "Protein Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2020) 

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Data Downloads

Publisher Information