From 4b6c0e31385f5f27a151088c0a2b614495c4e589 Mon Sep 17 00:00:00 2001 From: Paul Duncan Date: Thu, 14 Oct 2021 12:47:50 -0400 Subject: initial commit, including theme --- ...spam-and-i-still-haven-t-killed-anyone-yet.html | 106 +++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html (limited to 'content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html') diff --git a/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html new file mode 100644 index 0000000..3d48934 --- /dev/null +++ b/content/posts/2006-11-17-over-a-decade-of-spam-and-i-still-haven-t-killed-anyone-yet.html @@ -0,0 +1,106 @@ +--- +date: "2006-11-17T04:38:12Z" +title: Over a Decade of Spam and I Still Haven't Killed Anyone (Yet) +--- + +

I've been using SpamProbe to separate the wheat from the chaff for +the last four years. That, along with the fact that I rarely delete +email, gives me a reasonable set of data to analyze the performance of a +spam filter. So, how does SpamProbe stack up?

+ +

Graphs With Lines and Stuff

+ +

SpamProbe: Classifications per Month (Count)

+ +

The exponential increase has flattened the numbers we really care about, +and the logarithmic scaling plotting in Ploticus has failed +me, so here's the same graph with correct classifications omitted:

+ +

SpamProbe: False Classifications per Month, 2002-2006 (Count)

+ +

That second graph is mildly depressing, but it reflects my day-to-day +experience. Namely, more and more spam messages seem to be sneaking by +SpamProbe and being incorrectly classified as legitimate messages. But +how does the increase in false negatives stack up compared to the total +amount of spam I'm getting? Let's take a look at the data again, but +this time as a percentage rather than a sum:

+ +

SpamProbe: Classifications per Month, 2002-2006 (Percent)

+ +

And the same data again, without the correctly classified spam:

+ +

SpamProbe: Classifications per Month, 2002-2006 (Percent)

+ +

As you can see from the graphs, the percent of false positives, or +legitimate mail incorrectly classified as spam, sits pretty steady +around 0%, while the number of false negatives, or spam incorrectly +classified as legitimate mail, has hovered below 5% for just over two +years. Not too shabby for a lowly bayesian classifier. By the way, the +large peaks in the percentage graphs are mostly anomalous (see below).

+ +

Caveats

+ +

Are aphorisms about liars and statistics bouncing around in your head +right now? Good. Here's some of the gotchas with this data:

+ + + +

Conclusions

+ +

If I wanted to be scientific and objective and all that crap, or at +least methodical and thorough, I would take several competing spam +classifiers and feed them the same corpus, then compare the results. +I'm not trying really trying to be objective, though; SpamProbe seems to +be working pretty well, at least for now. Oh yeah, if you're interested +in playing with the actual numbers, or if you're curious how I processed +the data and generated the graphs, feel free to download the raw +data.

+ + -- cgit v1.2.3