Protein sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data. (This data was imported and made computable at 6 am CST on February 25, 2021.)
Details
This data is imported from the National Center for Biotechnology Information (NCBI) and formatted for computation.
Properties provided with each sequence include: "Accession", "Length", "Authors", "Publications", "GeographicLocation", "DetailedGeographicLocation", "USState", "Host", "Sequence", "CollectionDate", "ReleaseDate", "InclusionDate", "GenBankTitle", "Protein", "SequenceType", "ProteinStatus", "IsolationSource" and "BioSample".
Additional content elements include:
"CollectionHistogram"
a DateHistogram of when the sequences were collected
"ReleaseHistogram"
a DateHistogram of when the sequences were released to the public
"InclusionHistogram"
a DateHistogram of when the sequences were included in the source for this data
"AffectedLocations"
a world map showing where these sequences were collected
Most of these protein sequences are collected from humans, but not all:
In[4]:=
Out[4]=
Scope & Additional Elements
Get a date plot of collection dates:
In[5]:=
Out[5]=
See a data histogram of release dates:
In[6]:=
Out[6]=
See a timeline plot of inclusion dates:
In[7]:=
Out[7]=
Show the locations where the sequences were gathered:
In[8]:=
Out[8]=
Visualizations
Most of the provided protein sequences come from the United States and Australia:
In[9]:=
Out[9]=
When we look at the geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are largely maintained:
In[10]:=
Out[10]=
Most of the provided sequences come from regions in the Unites States and Australia:
In[11]:=
Out[11]=
When we look at the detailed geographic locations providing protein sequences with the most common title, “surface glycoprotein”, these proportions are again largely maintained:
In[12]:=
Out[12]=
Analysis
By gathering all of the titles by their protein label, we can see that the same proteins are submitted under a wide variety of names:
In[13]:=
Out[14]=
We can plot where these proteins are found along the reference SARS-CoV-2 genome. To properly find an alignment, we align the protein reference sequences with the translation of each potential frame shift and choose the best alignment:
In[15]:=
Out[22]=
Bibliographic Citation
Wolfram Research,
"Protein Sequences for the SARS-CoV-2 Coronavirus"
from the Wolfram Data Repository
(2021)
License Information
Public Domain
Data Resource History
Date Created:
Updated: 25 February 2021
Source Metadata
Title: Severe acute respiratory syndrome coronavirus 2 data hub: Search, retrieve, and analyze SARS-CoV-2 GenBank data.
Creator: National Center for Biotechnology Information