Wolfram Research

Protein Sequences for the SARS-CoV-2 Coronavirus

Source Notebook

Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data. (This data was imported and made computable at 6 am CST on February 25, 2021.)

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: "Accession", "Length", "Authors", "Publications", "GeographicLocation", "DetailedGeographicLocation", "USState", "Host", "Sequence", "CollectionDate", "ReleaseDate", "InclusionDate", "GenBankTitle", "Protein", "SequenceType", "ProteinStatus", "IsolationSource" and "BioSample".
Additional content elements include:
"CollectionHistogram" a DateHistogram of when the sequences were collected
"ReleaseHistogram" a DateHistogram of when the sequences were released to the public
"InclusionHistogram" a DateHistogram of when the sequences were included in the source for this data
"AffectedLocations" a world map showing where these sequences were collected

Examples

Basic Examples

Get a Dataset containing rows for each sequence:

In[1]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"]
Out[1]=

Return the latest date a sequence was included:

In[2]:=
ResourceData[
  "Protein Sequences for the SARS-CoV-2 Coronavirus"][Max, \
"InclusionDate"]
Out[2]=

Count proteins by the reported description:

In[3]:=
ReverseSortBy[
 ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
  Counts, "Protein"], #2]
Out[3]=

Most of these protein sequences are collected from humans, but not all:

In[4]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
  Select[Not[MissingQ[#Host]] &]][Counts, "Host"]
Out[4]=

Scope & Additional Elements

Get a date plot of collection dates:

In[5]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"CollectionHistogram"]
Out[5]=

See a data histogram of release dates:

In[6]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"ReleaseHistogram"]
Out[6]=

See a timeline plot of inclusion dates:

In[7]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"InclusionHistogram"]
Out[7]=

Show the locations where the sequences were gathered:

In[8]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", \
"AffectedLocations"]
Out[8]=

Visualizations

Most of the provided protein sequences come from the United States and Australia:

In[9]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#GeographicLocation]] &]][Counts, "GeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[9]=

When we look at the geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are largely maintained:

In[10]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[
      And[Not[MissingQ[#GeographicLocation]], #Protein === "surface glycoprotein"] &]][
    Counts, "GeographicLocation"]], -20],
 ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[10]=

Most of the provided sequences come from regions in the Unites States and Australia:

In[11]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#DetailedGeographicLocation]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[11]=

When we look at the detailed geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are again largely maintained:

In[12]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[And[#Protein === "surface glycoprotein", Not[MissingQ[#DetailedGeographicLocation]]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[12]=

Analysis

By gathering all of the titles by their protein label, we can see that the same proteins are submitted under a wide variety of names:

In[13]:=
processProteinLabel[proteinLabel_String] := Module[{prot = proteinLabel},
   If[StringStartsQ[prot, "RecName:"],
    prot = StringCases[prot, StartOfString ~~ "RecName: Full=" ~~ Shortest[fulltitle___] ~~ ";" | EndOfString :> StringTrim[fulltitle]][[1]]
    ];
   StringTrim@
    StringReplace[StringTrim@ToLowerCase[prot], {
     "membrance" -> "membrane", "2019-ncov s2 subunit,2019-ncov s2 subunit" -> "2019-ncov s2 \
subunit", "proteiin" -> "protein", ", partial" -> "", StringExpression["[", 
BlankNullSequence[], "]"] -> "", "-" -> "", "," -> "", StringExpression[WordBoundary, "protein", WordBoundary] -> "", "partial" -> "", StringExpression["chain ", WordCharacter, WordBoundary] -> ""}]];
refseqs = DeleteDuplicates@
  ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
   Select[#SequenceType == "RefSeq" &], {"Protein", "Sequence"}]; 
Dataset[Table[
   <|"NCBI Protein Label" -> s[[1]], "Submitted Names" -> DeleteDuplicatesBy[
      Union[processProteinLabel /@ DeleteMissing[
         Normal@ResourceData[
            "Protein Sequences for the SARS-CoV-2 Coronavirus"][Select[
            Or[#Protein == s[[1]], #Sequence == s[[2]]]
             &], "GenBankTitle"]]],
      ToLowerCase]
    |>
   , {s, Normal[refseqs]}
   ]
  ][ReverseSortBy[Length[#"Submitted Names"] &]]
Out[14]=

We can plot where these proteins are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align the protein reference sequences with the translation of each potential frame shift and choose the best alignment:

In[15]:=
originalReferenceSequence = BioSequence["DNA", Normal[ResourceData[
        "Genetic Sequences for the SARS-CoV-2 Coronavirus"][
       Select[#Accession === "NC_045512" &]]][[1]]["Sequence"]];
translateDNASeqWithOffset[seq_BioSequence, offset_] := BioSequenceTranslate[
   BioSequenceModify[StringDrop[seq, offset], "DropIncompleteCodons"]];
shiftTranslations = translateDNASeqWithOffset[originalReferenceSequence, #] & /@ Range[0, 2];
extractAligningRange[alignment : {__}] := Module[{workingAlignment, startCoord, endCoord},
   If[MatchQ[Last[alignment], {_, ""}],
    workingAlignment = Most[alignment],
    workingAlignment = alignment
    ];
   If[MatchQ[First[workingAlignment], {_, ""}],
    startCoord = StringLength[workingAlignment[[1, 1]]] + 1;
    workingAlignment = Rest[workingAlignment],
    startCoord = 1
    ];
   endCoord = Total[StringLength[If[StringQ[#], #, First[#]]] & /@ workingAlignment] + startCoord - 1;
   {startCoord, endCoord}
   ];
localLength[alignment_List] := Max[StringLength[StringJoin[Last /@ Select[alignment, ListQ]]],
   StringLength[
    StringJoin[First /@ Select[alignment[[2 ;; -2]], ListQ]]]];
findAligningRangeFromTranslations[seq_] := Module[{alignments, bestShiftPos, localLengths},
   alignments = SequenceAlignment[#, seq, Method -> "Local"] & /@ shiftTranslations;
   localLengths = localLength /@ alignments;
   bestShiftPos = First[Ordering[localLengths]];
   If[localLengths[[bestShiftPos]]/StringLength[seq] < 0.1,
    ((extractAligningRange[alignments[[bestShiftPos]]] + bestShiftPos - 1)*3) - 2,
    None
    ]
   ];
alignmentRanges = SortBy[DeleteCases[{#[[1]], findAligningRangeFromTranslations[#[[2]]]} & /@ Values[Normal@refseqs], {_, None}], MinMax[Last[#]] &];
Column[Map[Function[{plotBatch},
   NumberLinePlot[Reverse[Interval /@ plotBatch[[All, 2]]], PlotLegends -> Reverse[plotBatch[[All, 1]]], PlotRange -> {1, 29903}, ImageSize -> 400, AxesLabel -> "base pairs"]
   ],
  Partition[alignmentRanges, UpTo[8]]
  ]]
Out[22]=

Wolfram Research, "Protein Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021) 

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Data Downloads

Publisher Information