aboutsummaryrefslogtreecommitdiff
path: root/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html')
-rw-r--r--content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html106
1 files changed, 106 insertions, 0 deletions
diff --git a/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html
new file mode 100644
index 0000000..3d48934
--- /dev/null
+++ b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html
@@ -0,0 +1,106 @@
+---
+date: "2006-11-17T04:38:12Z"
+title: Over a Decade of Spam and I Still Haven't Killed Anyone (Yet)
+---
+
+<p>I've been using <a href="http://spamprobe.sf.net/" title="Command-line bayesian spam classification utility.">SpamProbe</a> to separate the wheat from the chaff for
+the last four years. That, along with the fact that I rarely delete
+email, gives me a reasonable set of data to analyze the performance of a
+spam filter. So, how does SpamProbe stack up? </p>
+
+<p><strong>Graphs With Lines and Stuff</strong></p>
+
+<p><img
+ src='http://pablotron.org/files/spam-200611/spam-2006-11-all_raw.png'
+ width='557' height='374'
+ title='SpamProbe: Classifications per Month (Count)'
+ alt='SpamProbe: Classifications per Month (Count)'
+ style='border:1px solid black;'
+ border='0'
+/></p>
+
+<p>The exponential increase has flattened the numbers we really care about,
+and the <a href="http://en.wikipedia.org/wiki/Logarithmic_scale" title="Wikipedia entry on Logarithmic Scale.">logarithmic scaling</a> plotting in <a href="http://ploticus.sf.net/" title="Command-line business graphing and plotting application.">Ploticus</a> has failed
+me, so here's the same graph with correct classifications omitted:</p>
+
+<p><img
+ src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_raw.png'
+ width='557' height='357'
+ title='SpamProbe: False Classifications per Month, 2002-2006 (Count)'
+ alt='SpamProbe: False Classifications per Month, 2002-2006 (Count)'
+ style='border:1px solid black;'
+ border='0'
+/></p>
+
+<p>That second graph is mildly depressing, but it reflects my day-to-day
+experience. Namely, more and more spam messages seem to be sneaking by
+SpamProbe and being incorrectly classified as legitimate messages. But
+how does the increase in false negatives stack up compared to the total
+amount of spam I'm getting? Let's take a look at the data again, but
+this time as a percentage rather than a sum:</p>
+
+<p><img
+ src='http://pablotron.org/files/spam-200611/spam-2006-11-all_pct.png'
+ width='557' height='374'
+ title='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
+ alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
+ style='border:1px solid black;'
+ border='0'
+/></p>
+
+<p>And the same data again, without the correctly classified spam:</p>
+
+<p><img
+ src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_pct.png'
+ width='557' height='357'
+ title='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
+ alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)'
+ style='border:1px solid black;'
+ border='0'
+/></p>
+
+<p>As you can see from the graphs, the percent of false positives, or
+legitimate mail incorrectly classified as spam, sits pretty steady
+around 0%, while the number of false negatives, or spam incorrectly
+classified as legitimate mail, has hovered below 5% for just over two
+years. Not too shabby for a lowly bayesian classifier. By the way, the
+large peaks in the percentage graphs are mostly anomalous (see below).</p>
+
+<p><strong>Caveats</strong></p>
+
+<p>Are aphorisms about liars and statistics bouncing around in your head
+right now? Good. Here's some of the gotchas with this data:</p>
+
+<ul>
+<li>The graphs above do not include "ham". Ham is
+correctly-classified, non-spam messages. Including ham would flatten
+the percentage graphs by increasing the percent of correctly
+classified messages and decreasing the percent of falsly classified
+messages. If there's any interest, I can add additional graphs which
+include correctly classified, non-spam messages.</li>
+<li>The false negative peaks in months 24 and 28 weren't due to any
+mistakes on the part of SpamProbe; I managed to break SpamProbe and/or
+fill up the disk where my mail is stored on a couple of occasions.</li>
+<li>I have catch-all addresses enabled for some of my domains (e.g.
+foo@example.com, bar@example.com, and asdf200notarealname@example.com
+are all routed to my inbox). This necessarily affect the accuracy of
+SpamProbe, but it certainly increases the amount of spam I receive.</li>
+<li>I purchased a few additional domains between 2002 and 2006.
+Although I haven't added any within the last year, so that doesn't
+account for the exponential increase in spam in the last 12 months.</li>
+<li>I upgraded SpamProbe a handful of times, and re-trained the classifier
+once.</li>
+</ul>
+
+<p><strong>Conclusions</strong></p>
+
+<p>If I wanted to be scientific and objective and all that crap, or at
+least methodical and thorough, I would take several competing spam
+classifiers and feed them the same corpus, then compare the results.
+I'm not trying really trying to be objective, though; SpamProbe seems to
+be working pretty well, at least for now. Oh yeah, if you're interested
+in playing with the actual numbers, or if you're curious how I processed
+the data and generated the graphs, feel free to <a href="http://pablotron.org/files/spam-200611/spam-200611.tar.gz" title="Download the scripts and raw data used to generate these graphs.">download the raw
+data</a>.</p>
+
+