diff options
Diffstat (limited to 'content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html')
-rw-r--r-- | content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html | 106 |
1 files changed, 106 insertions, 0 deletions
diff --git a/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html new file mode 100644 index 0000000..3d48934 --- /dev/null +++ b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html @@ -0,0 +1,106 @@ +--- +date: "2006-11-17T04:38:12Z" +title: Over a Decade of Spam and I Still Haven't Killed Anyone (Yet) +--- + +<p>I've been using <a href="http://spamprobe.sf.net/" title="Command-line bayesian spam classification utility.">SpamProbe</a> to separate the wheat from the chaff for +the last four years. That, along with the fact that I rarely delete +email, gives me a reasonable set of data to analyze the performance of a +spam filter. So, how does SpamProbe stack up? </p> + +<p><strong>Graphs With Lines and Stuff</strong></p> + +<p><img + src='http://pablotron.org/files/spam-200611/spam-2006-11-all_raw.png' + width='557' height='374' + title='SpamProbe: Classifications per Month (Count)' + alt='SpamProbe: Classifications per Month (Count)' + style='border:1px solid black;' + border='0' +/></p> + +<p>The exponential increase has flattened the numbers we really care about, +and the <a href="http://en.wikipedia.org/wiki/Logarithmic_scale" title="Wikipedia entry on Logarithmic Scale.">logarithmic scaling</a> plotting in <a href="http://ploticus.sf.net/" title="Command-line business graphing and plotting application.">Ploticus</a> has failed +me, so here's the same graph with correct classifications omitted:</p> + +<p><img + src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_raw.png' + width='557' height='357' + title='SpamProbe: False Classifications per Month, 2002-2006 (Count)' + alt='SpamProbe: False Classifications per Month, 2002-2006 (Count)' + style='border:1px solid black;' + border='0' +/></p> + +<p>That second graph is mildly depressing, but it reflects my day-to-day +experience. Namely, more and more spam messages seem to be sneaking by +SpamProbe and being incorrectly classified as legitimate messages. But +how does the increase in false negatives stack up compared to the total +amount of spam I'm getting? Let's take a look at the data again, but +this time as a percentage rather than a sum:</p> + +<p><img + src='http://pablotron.org/files/spam-200611/spam-2006-11-all_pct.png' + width='557' height='374' + title='SpamProbe: Classifications per Month, 2002-2006 (Percent)' + alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)' + style='border:1px solid black;' + border='0' +/></p> + +<p>And the same data again, without the correctly classified spam:</p> + +<p><img + src='http://pablotron.org/files/spam-200611/spam-2006-11-fls_pct.png' + width='557' height='357' + title='SpamProbe: Classifications per Month, 2002-2006 (Percent)' + alt='SpamProbe: Classifications per Month, 2002-2006 (Percent)' + style='border:1px solid black;' + border='0' +/></p> + +<p>As you can see from the graphs, the percent of false positives, or +legitimate mail incorrectly classified as spam, sits pretty steady +around 0%, while the number of false negatives, or spam incorrectly +classified as legitimate mail, has hovered below 5% for just over two +years. Not too shabby for a lowly bayesian classifier. By the way, the +large peaks in the percentage graphs are mostly anomalous (see below).</p> + +<p><strong>Caveats</strong></p> + +<p>Are aphorisms about liars and statistics bouncing around in your head +right now? Good. Here's some of the gotchas with this data:</p> + +<ul> +<li>The graphs above do not include "ham". Ham is +correctly-classified, non-spam messages. Including ham would flatten +the percentage graphs by increasing the percent of correctly +classified messages and decreasing the percent of falsly classified +messages. If there's any interest, I can add additional graphs which +include correctly classified, non-spam messages.</li> +<li>The false negative peaks in months 24 and 28 weren't due to any +mistakes on the part of SpamProbe; I managed to break SpamProbe and/or +fill up the disk where my mail is stored on a couple of occasions.</li> +<li>I have catch-all addresses enabled for some of my domains (e.g. +foo@example.com, bar@example.com, and asdf200notarealname@example.com +are all routed to my inbox). This necessarily affect the accuracy of +SpamProbe, but it certainly increases the amount of spam I receive.</li> +<li>I purchased a few additional domains between 2002 and 2006. +Although I haven't added any within the last year, so that doesn't +account for the exponential increase in spam in the last 12 months.</li> +<li>I upgraded SpamProbe a handful of times, and re-trained the classifier +once.</li> +</ul> + +<p><strong>Conclusions</strong></p> + +<p>If I wanted to be scientific and objective and all that crap, or at +least methodical and thorough, I would take several competing spam +classifiers and feed them the same corpus, then compare the results. +I'm not trying really trying to be objective, though; SpamProbe seems to +be working pretty well, at least for now. Oh yeah, if you're interested +in playing with the actual numbers, or if you're curious how I processed +the data and generated the graphs, feel free to <a href="http://pablotron.org/files/spam-200611/spam-200611.tar.gz" title="Download the scripts and raw data used to generate these graphs.">download the raw +data</a>.</p> + + |