Wolfram Data Repository
Immediate Computable Access to Curated Contributed Data
Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data
| "LatestData" | a Dataset containing the most recently collected data |
| "CollectionHistogram" | a DateHistogram of when the sequences were collected |
| "ReleaseHistogram" | a DateHistogram of when the sequences were released to the public |
| "AffectedLocations" | a world map showing where these sequences were collected |
| “SubmissionAuthors” | a Dataset containing the accessions for each author list |
Get a Dataset containing rows for the most recently released sequences:
| In[1]:= |
| Out[1]= | ![]() |
Obtain the the number of rows for all sequences (the first call for all sequences can take some time):
| In[2]:= |
| Out[2]= |
Return the latest date a sequence was released:
| In[3]:= |
| Out[3]= |
Count proteins by the reported description:
| In[4]:= |
| Out[4]= | ![]() |
Most of these protein sequences are collected from humans, but not all:
| In[5]:= | ![]() |
| Out[5]= | ![]() |
Some of these protein sequences correspond to named variations of interest as designated by the World Health Organization (WHO):
| In[6]:= | ![]() |
| Out[6]= | ![]() |
Get a date plot of collection dates:
| In[7]:= |
| Out[7]= | ![]() |
See a data histogram of release dates:
| In[8]:= |
| Out[8]= | ![]() |
Show the locations where the sequences were gathered:
| In[9]:= |
| Out[9]= | ![]() |
Obtain which accessions were provided by each submitter:
| In[10]:= |
| Out[10]= | ![]() |
Most of the provided protein sequences come from the United States and Australia:
| In[11]:= | ![]() |
| Out[11]= | ![]() |
When we look at the geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are largely maintained:
| In[12]:= | ![]() |
| Out[12]= | ![]() |
Most of the provided sequences come from regions in the Unites States and Australia:
| In[13]:= | ![]() |
| Out[13]= | ![]() |
When we look at the detailed geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are again largely maintained:
| In[14]:= | ![]() |
| Out[14]= | ![]() |
By gathering all of the titles by their protein label, we can see that the same proteins are submitted under a wide variety of names:
| In[15]:= | ![]() |
| Out[16]= | ![]() |
We can plot where these proteins are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align the protein reference sequences with the translation of each potential frame shift and choose the best alignment:
| In[17]:= | ![]() |
| Out[24]= | ![]() |
Wolfram Research, "Protein Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021)
Public Domain