Basic Examples
Get a Dataset containing rows for the most recent sequences:
Get a Dataset containing rows for all sequences (this can take considerable time to download and expand):
Return the latest date a sequence was released:
Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced:
Most of these SARS-CoV-2 samples are collected from humans, but not all:
Some of these genetic sequences correspond to named variations of interest as designated by the World Health Organization (WHO):
Scope & Additional Elements
Get a date histogram of collection dates:
See a date histogram of release dates:
Show the locations where the sequences were gathered:
Obtain the available alignment differences with the reference sequence:
Show the authors with the accessions of the sequences they submitted:
Obtain the reference sequence as a biomolecular sequence:
Visualizations
A phylogenetic tree comparison of the most-common complete genomes by location shows clusters that are broadly distributed. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:
A similar visualization can be created for samples where more detailed geographic information is supplied. In this visualization of most-common sequences reported for US states, we see the emergence of clusters containing interesting regional blocks as shown in the map below:
When visualizing the similarity of the most common sequence by month of sequence collection, there are recurring overlaps (most significantly between December 2019 and February 2020), illustrating that the virus has not only seen evolution, but significant continuity. Since then, greater spread has led to further divergence:
Analysis
Using the provided alignment differences, we can see where along the viral genome changes have been detected over time. We see that while mutations are relatively uniformly distributed, there are certainly changes more commonly measured than others:
It is also possible to treat these genetic differences as lists of features:
By doing so, it is possible to perform a fairly wide variety of analysis. Here, we determine all of the genetic differences that always occur together in the sampled sequences, taking advantage of the fact that when differences always occur together they must occur in the same number of sequences: