Protein Sequences for the SARS-CoV-2 Coronavirus

Source Notebook

Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data

Details

This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: "Accession", "Length", "Publications", "GeographicLocation", "DetailedGeographicLocation", "USState", "Host", "Sequence", "CollectionDate", "ReleaseDate", "GenBankTitle", "Protein", "SequenceType", "ProteinStatus", "IsolationSource", "BioSample", "PangoLineage" and "WHONamedVariant".
Additional content elements include:
"LatestData"a Dataset containing the most recently collected data
"CollectionHistogram"a DateHistogram of when the sequences were collected
"ReleaseHistogram"a DateHistogram of when the sequences were released to the public
"AffectedLocations"a world map showing where these sequences were collected
“SubmissionAuthors” a Dataset containing the accessions for each author list

Examples

Basic Examples

Get a Dataset containing rows for the most recently released sequences:

In[1]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", "LatestData"]
Out[1]=

Obtain the the number of rows for all sequences (the first call for all sequences can take some time):

In[2]:=
Length[ResourceData[
  "Protein Sequences for the SARS-CoV-2 Coronavirus"]]
Out[2]=

Return the latest date a sequence was released:

In[3]:=
ResourceData[
  "Protein Sequences for the SARS-CoV-2 Coronavirus"][Max, "ReleaseDate"]
Out[3]=

Count proteins by the reported description:

In[4]:=
ReverseSortBy[
 ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
  Counts, "Protein"], #2]
Out[4]=

Most of these protein sequences are collected from humans, but not all:

In[5]:=
ReverseSort[
 ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
   Select[Not[MissingQ[#Host]] &]][Counts, "Host"]]
Out[5]=

Some of these protein sequences correspond to named variations of interest as designated by the World Health Organization (WHO):

In[6]:=
ReverseSort[
 ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
   Select[Not[MissingQ[#WHONamedVariant]] &]][Counts, "WHONamedVariant"]]
Out[6]=

Scope & Additional Elements

Get a date plot of collection dates:

In[7]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", "CollectionHistogram"]
Out[7]=

See a data histogram of release dates:

In[8]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", "ReleaseHistogram"]
Out[8]=

Show the locations where the sequences were gathered:

In[9]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", "AffectedLocations"]
Out[9]=

Obtain which accessions were provided by each submitter:

In[10]:=
ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus", "SubmissionAuthors"]
Out[10]=

Visualizations

Most of the provided protein sequences come from the United States and Australia:

In[11]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#GeographicLocation]] &]][Counts, "GeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[11]=

When we look at the geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are largely maintained:

In[12]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[
      And[Not[MissingQ[#GeographicLocation]], #Protein === "surface glycoprotein"] &]][
    Counts, "GeographicLocation"]], -20],
 ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[12]=

Most of the provided sequences come from regions in the Unites States and Australia:

In[13]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[Not[MissingQ[#DetailedGeographicLocation]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[13]=

When we look at the detailed geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are again largely maintained:

In[14]:=
BarChart[Take[
  Sort[ResourceData[
      "Protein Sequences for the SARS-CoV-2 Coronavirus"][
     Select[And[#Protein === "surface glycoprotein", Not[MissingQ[#DetailedGeographicLocation]]] &]][Counts, "DetailedGeographicLocation"]], -20], ChartLabels -> Automatic, BarOrigin -> Left, ImageSize -> 700]
Out[14]=

Analysis

By gathering all of the titles by their protein label, we can see that the same proteins are submitted under a wide variety of names:

In[15]:=
processProteinLabel[
   proteinLabel_String] := (processProteinLabel[proteinLabel] = Module[{prot = proteinLabel},
     If[StringStartsQ[prot, "RecName:"],
      prot = StringCases[prot, StartOfString ~~ "RecName: Full=" ~~ Shortest[fulltitle___] ~~
            ";" | EndOfString :> StringTrim[fulltitle]][[1]]
      ];
     StringTrim@
      StringReplace[StringTrim@ToLowerCase[prot], {
       "membrance" -> "membrane", "2019-ncov s2 subunit,2019-ncov s2 subunit" -> "2019-ncov s2 subunit", "proteiin" -> "protein", ", partial" -> "", StringExpression["[", 
BlankNullSequence[], "]"] -> "", "-" -> "", "," -> "", StringExpression[WordBoundary, "protein", WordBoundary] -> "",
         "partial" -> "", StringExpression[
         "chain ", WordCharacter, WordBoundary] -> ""}]]);
refseqs = DeleteDuplicates@
  ResourceData["Protein Sequences for the SARS-CoV-2 Coronavirus"][
   Select[#SequenceType == "RefSeq" &], {"Protein", "Sequence"}]; 
Dataset[Table[
   <|"NCBI Protein Label" -> s[[1]], "Submitted Names" -> DeleteDuplicatesBy[
      Union[processProteinLabel /@ DeleteMissing[
         Normal@ResourceData[
            "Protein Sequences for the SARS-CoV-2 Coronavirus"][Select[
            Or[#Protein == s[[1]], #Sequence == s[[2]]]
             &], "GenBankTitle"]]],
      ToLowerCase]
    |>
   , {s, Normal[refseqs]}
   ]
  ][ReverseSortBy[Length[#"Submitted Names"] &]]
Out[16]=

We can plot where these proteins are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align the protein reference sequences with the translation of each potential frame shift and choose the best alignment:

In[17]:=
originalReferenceSequence = ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", "ReferenceBioSequence"];
translateDNASeqWithOffset[seq_BioSequence, offset_] := BioSequenceTranslate[
   BioSequenceModify[StringDrop[seq, offset], "DropIncompleteCodons"]];
shiftTranslations = translateDNASeqWithOffset[originalReferenceSequence, #] & /@ Range[0, 2];
extractAligningRange[alignment : {__}] := Module[{workingAlignment, startCoord, endCoord},
   If[MatchQ[Last[alignment], {_, ""}],
    workingAlignment = Most[alignment],
    workingAlignment = alignment
    ];
   If[MatchQ[First[workingAlignment], {_, ""}],
    startCoord = StringLength[workingAlignment[[1, 1]]] + 1;
    workingAlignment = Rest[workingAlignment],
    startCoord = 1
    ];
   endCoord = Total[StringLength[If[StringQ[#], #, First[#]]] & /@ workingAlignment] + startCoord - 1;
   {startCoord, endCoord}
   ];
localLength[alignment_List] := Max[StringLength[StringJoin[Last /@ Select[alignment, ListQ]]],
   StringLength[
    StringJoin[First /@ Select[alignment[[2 ;; -2]], ListQ]]]];
findAligningRangeFromTranslations[seq_] := Module[{alignments, bestShiftPos, localLengths},
   alignments = SequenceAlignment[#, seq, Method -> "Local"] & /@ shiftTranslations;
   localLengths = localLength /@ alignments;
   bestShiftPos = First[Ordering[localLengths]];
   If[localLengths[[bestShiftPos]]/StringLength[seq] < 0.1,
    ((extractAligningRange[alignments[[bestShiftPos]]] + bestShiftPos - 1)*3) - 2,
    None
    ]
   ];
alignmentRanges = SortBy[DeleteCases[{#[[1]], findAligningRangeFromTranslations[#[[2]]]} & /@ Values[Normal@refseqs], {_, None}], MinMax[Last[#]] &];
Column[Map[Function[{plotBatch},
   NumberLinePlot[Reverse[Interval /@ plotBatch[[All, 2]]], PlotLegends -> Reverse[plotBatch[[All, 1]]], PlotRange -> {1, 29903}, ImageSize -> 400, AxesLabel -> "base pairs"]
   ],
  Partition[alignmentRanges, UpTo[8]]
  ]]
Out[24]=

Wolfram Research, "Protein Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021)  

License Information

Public Domain

Data Resource History

Source Metadata

See Also

Publisher Information