aboutsummaryrefslogtreecommitdiff
path: root/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html
blob: 3d4893479a68c649611f9c7a4e8afd3b7aab70ea (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
date: "2006-11-17T04:38:12Z"
title: Over a Decade of Spam and I Still Haven't Killed Anyone (Yet)
---

<p>I've been using <a href="http://spamprobe.sf.net/" title="Command-line bayesian spam classification utility.">SpamProbe</a> to separate the wheat from the chaff for
the last four years.  That, along with the fact that I rarely delete
email, gives me a reasonable set of data to analyze the performance of a
spam filter.  So, how does SpamProbe stack up?  </p>

<p><strong>Graphs With Lines and Stuff</strong></p>

<p><img 
  src='http://pablotron.org/files/spam-200611/spam-2006-11-all_raw.png'
  width='557' height='374' 
  title='SpamProbe: Classifications per Month (Count)'
  alt='SpamProbe: Classifications per Month (Count)'
  style='border:1px solid black;'
  border='0'
/></p>

<p>The exponential increase has flattened the numbers we really care about,
and the <a href="http://en.wikipedia.org/wiki/Logarithmic_scale" title="Wikipedia entry on Logarithmic Scale.">logarithmic scaling</a> plotting in <a href="http://ploticus.sf.net/" title="Command-line business graphing and plotting application.">Ploticus</a> has failed
me, so here's the same graph with correct classifications omitted:</p>

<p><img 
  src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_raw.png'
  width='557' height='357' 
  title='SpamProbe: False Classifications per Month, 2002-2006 (Count)'
  alt='SpamProbe: False Classifications per Month, 2002-2006 (Count)'
  style='border:1px solid black;'
  border='0'
/></p>

<p>That second graph is mildly depressing, but it reflects my day-to-day
experience. Namely, more and more spam messages seem to be sneaking by
SpamProbe and being incorrectly classified as legitimate messages.  But
how does the increase in false negatives stack up compared to the total
amount of spam I'm getting?  Let's take a look at the data again, but
this time as a percentage rather than a sum:</p>

<p><img 
  src='http://pablotron.org/files/spam-200611/spam-2006-11-all_pct.png'
  width='557' height='374' 
  title='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
  alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
  style='border:1px solid black;'
  border='0'
/></p>

<p>And the same data again, without the correctly classified spam:</p>

<p><img 
  src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_pct.png'
  width='557' height='357' 
  title='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
  alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
  style='border:1px solid black;'
  border='0'
/></p>

<p>As you can see from the graphs, the percent of false positives, or
legitimate mail incorrectly classified as spam, sits pretty steady
around 0%, while the number of false negatives, or spam incorrectly
classified as legitimate mail, has hovered below 5% for just over two
years.  Not too shabby for a lowly bayesian classifier.  By the way, the
large peaks in the percentage graphs are mostly anomalous (see below).</p>

<p><strong>Caveats</strong></p>

<p>Are aphorisms about liars and statistics bouncing around in your head
right now?  Good.  Here's some of the gotchas with this data:</p>

<ul>
<li>The graphs above do not include "ham".  Ham is
correctly-classified, non-spam messages.  Including ham would flatten
the percentage graphs by increasing the percent of correctly
classified messages and decreasing the percent of falsly classified
messages.  If there's any interest, I can add additional graphs which
include correctly classified, non-spam messages.</li>
<li>The false negative peaks in months 24 and 28 weren't due to any
mistakes on the part of SpamProbe; I managed to break SpamProbe and/or
fill up the disk where my mail is stored on a couple of occasions.</li>
<li>I have catch-all addresses enabled for some of my domains (e.g.
foo@example.com, bar@example.com, and asdf200notarealname@example.com
are all routed to my inbox).  This necessarily affect the accuracy of
SpamProbe, but it certainly increases the amount of spam I receive.</li>
<li>I purchased a few additional domains between 2002 and 2006.
Although I haven't added any within the last year, so that doesn't
account for the exponential increase in spam in the last 12 months.</li>
<li>I upgraded SpamProbe a handful of times, and re-trained the classifier
once.</li>
</ul>

<p><strong>Conclusions</strong></p>

<p>If I wanted to be scientific and objective and all that crap, or at
least methodical and thorough, I would take several competing spam
classifiers and feed them the same corpus, then compare the results.
I'm not trying really trying to be objective, though; SpamProbe seems to
be working pretty well, at least for now.  Oh yeah, if you're interested
in playing with the actual numbers, or if you're curious how I processed
the data and generated the graphs, feel free to <a href="http://pablotron.org/files/spam-200611/spam-200611.tar.gz" title="Download the scripts and raw data used to generate these graphs.">download the raw
data</a>.</p>