Wolfram Computation Meets Knowledge

Sample Data: Gene Sequences

Splice-junction Gene Sequences for Primate DNA

Splice junctions are points on a DNA sequence at which "superfluous" DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). In the biological community, intron/exon borders are referred to a "acceptors" while exon/intron borders are referred to as "donors".

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["Sample Data: Gene Sequences"]
Out[1]=

Retrieve the default content:

In[2]:=
ResourceData["Sample Data: Gene Sequences"]
Out[2]=

Analysis

Shuffle the dataset randomly:

In[3]:=
r = RandomSample[ResourceData["Sample Data: Gene Sequences"]]
Out[3]=

Create a training dataset using 80% of the original dataset:

In[4]:=
training = r[[;; Round[Times[Length[r], .8]]]]
Out[4]=

Create a testing dataset using the remaining 20% of the original dataset:

In[5]:=
testing = r[[Round[Times[Length[r], .8]] + 1 ;;]]
Out[5]=

Train a classifier:

In[6]:=
c = Classify[training -> "Boundary"]
Out[6]=

Obtain general information about the classifier:

In[7]:=
ClassifierInformation[c]
Out[7]=

Generate a ClassifierMeasurementsObject of the classifier with the test set:

In[8]:=
cm = ClassifierMeasurements[c, testing -> "Boundary"]
Out[8]=

Visualize the accuracy of the classifier:

In[9]:=
cm["ConfusionMatrixPlot"]
Out[9]=

Wolfram Research, "Sample Data: Gene Sequences" from the Wolfram Data Repository (2018) 

License Information

Creative Commons Public Domain Mark

Data Resource History

Source Metadata

Data Downloads

Publisher Information