Wolfram Data Repository
Immediate Computable Access to Curated Contributed Data
Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data
"LatestData" | a Dataset containing the most recently collected data |
"CollectionHistogram" | a DateHistogram of when the sequences were collected |
"ReleaseHistogram" | a DateHistogram of when the sequences were released to the public |
"AffectedLocations" | a world map showing where these sequences were collected |
“SubmissionAuthors” | a Dataset containing the accessions for each author list |
Get a Dataset containing rows for the most recently released sequences:
In[1]:= |
Out[1]= |
Obtain the the number of rows for all sequences (the first call for all sequences can take some time):
In[2]:= |
Out[2]= |
Return the latest date a sequence was released:
In[3]:= |
Out[3]= |
Count proteins by the reported description:
In[4]:= |
Out[4]= |
Most of these protein sequences are collected from humans, but not all:
In[5]:= |
Out[5]= |
Some of these protein sequences correspond to named variations of interest as designated by the World Health Organization (WHO):
In[6]:= |
Out[6]= |
Get a date plot of collection dates:
In[7]:= |
Out[7]= |
See a data histogram of release dates:
In[8]:= |
Out[8]= |
Show the locations where the sequences were gathered:
In[9]:= |
Out[9]= |
Obtain which accessions were provided by each submitter:
In[10]:= |
Out[10]= |
Most of the provided protein sequences come from the United States and Australia:
In[11]:= |
Out[11]= |
When we look at the geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are largely maintained:
In[12]:= |
Out[12]= |
Most of the provided sequences come from regions in the Unites States and Australia:
In[13]:= |
Out[13]= |
When we look at the detailed geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are again largely maintained:
In[14]:= |
Out[14]= |
By gathering all of the titles by their protein label, we can see that the same proteins are submitted under a wide variety of names:
In[15]:= |
Out[16]= |
We can plot where these proteins are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align the protein reference sequences with the translation of each potential frame shift and choose the best alignment:
In[17]:= |
Out[24]= |
Wolfram Research, "Protein Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021)
Public Domain