Dataset of email statistics for the classification of spam email
The concept of "spam" is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography, etc (i.e. unsolicited commercial email), can all be considered spam. This dataset contains computed variables from a collection of emails. The collection was analyzed to determine the frequency of certain words, characters and lengths of continuous strings of capital letters. These attributes can be used to classify emails as spam or non-spam. The specific words and characters used in this analysis may or may not be generalizable classifying any email as spam (for example, the words "george" and the area code "650" were used to classify emails as non-spam in this collection, which may or may not generalize to another collection of e-mails).