Wolfram Computation Meets Knowledge

Sample Data: Spam Email

Dataset of email statistics for the classification of spam email

The concept of "spam" is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography, etc (i.e. unsolicited commercial email), can all be considered spam. This dataset contains computed variables from a collection of emails. The collection was analyzed to determine the frequency of certain words, characters and lengths of continuous strings of capital letters. These attributes can be used to classify emails as spam or non-spam. The specific words and characters used in this analysis may or may not be generalizable classifying any email as spam (for example, the words "george" and the area code "650" were used to classify emails as non-spam in this collection, which may or may not generalize to another collection of e-mails).

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["Sample Data: Spam Email"]
Out[1]=

Retrieve the default content:

In[2]:=
ResourceData["Sample Data: Spam Email"]
Out[2]=

Analysis

Shuffle the dataset randomly and remove the unit labels:

In[3]:=
r = QuantityMagnitude /@ 
  RandomSample[ResourceData["Sample Data: Spam Email"]]
Out[3]=

Create a training dataset using 80% of the original dataset:

In[4]:=
training = r[[;; Round[Times[Length[r], .8]]]]
Out[4]=

Create a testing dataset using the remaining 20% of the original dataset:

In[5]:=
testing = r[[Round[Times[Length[r], .8]] + 1 ;;]]
Out[5]=

Train a classifier:

In[6]:=
c = Classify[training -> "Spam"]
Out[6]=

Obtain general information about the classifier:

In[7]:=
ClassifierInformation[c]
Out[7]=

Generate a ClassifierMeasurementsObject of the classifier with the test set:

In[8]:=
cm = ClassifierMeasurements[c, testing -> "Spam"]
Out[8]=

Visualize the accuracy of the classifier:

In[9]:=
cm["ConfusionMatrixPlot"]
Out[9]=

Wolfram Research, "Sample Data: Spam Email" from the Wolfram Data Repository (2017) 

License Information

Creative Commons Public Domain Mark

Data Resource History

Source Metadata

Data Downloads

Publisher Information