Open Discussion → My POPFile Stats

My POPFile Stats

Hello.

I've been using POPFile since 2004. I use 2 buckets, despite knowing that having multiple buckets leads to more accurate filtering. My logic was that I use POPFile for one reason - to differentiate between legitimate and illegitimate messages - and therefore wanted to train it to walk a fine line. I use Pegasus Mail with a domain that is 10+ years old, a personal e-mail address that gets way too much spam and a "spam" e-mail address (for newsletters and crap) that I am positive gets sold to just about everybody. For about a third of the past three years I have been using POPFile with "catch-all" (from a dozen or so domains) being sent to my "spam" e-mail.

In May of 2007 I reset my statistics because I had reached what I considered a fully functional filtering state. That is, to say, I hit 99% accuracy.

I meant to post my stats in this past May but moved cross country. Here they are. (ignore the "last reset" date, I just reinstalled POPFile and this database on a new computer)

http://www.redlinewhiteline.com/images/popfile.png

I find that my database has no trouble keeping up with the continual evolution of spam tactics. Every couple of weeks they try a new one that sneaks past POPFile's filtering and all I need do is flag one message as spam and I am "in the clear" again. One advantage of using POPFile with Pegasus is that by utilizing Pegasus's built-in whitelist it is extremely easy to catch false positives - it has been a good long while since a 'personal' e-mail skated by as 'spam'.

I realize most of this info is probably utterly useless, but I thought I would post about my setup.

Thanks for making such an awesome program.

Bryce

  • Message #1353

    I realize most of this info is probably utterly useless, but I thought I would post about my setup.

    It is not useless, it makes a nice change from getting complaints!

    On my system I have one spam bucket plus six others for good mail and my accuracy is similar to yours:
    104,119 msgs, 460 errors, 99.56% accuracy

    Some users have configured POPFile to send some basic statistics (the number of buckets, the number of messages that POPFile has classified and the number of classification errors) and these reports are summarised on the POPFile Real Time Statistics page.

    Brian

  • Message #1355

    I have been using POPfile since 8/2004 and love it. I reset my stats at the end of February in 2006. Since then POPFile has processed 921,453 emails into 80-100 buckets with 99.62% accuracy. 97.75% of those emails were spam.
    POPFile started on a windows xp box and then switched to a Liunx box w/Thunderbird sometime in late 04 or early 05. Right now it resides on an Ubuntu 9.04 box with Thunderbird for an email client.

    Thanks for the outstanding application

    • Message #1938

      I have been using POPfile since 8/2004 and love it. I reset my stats at the end of February in 2006. Since then POPFile has processed 921,453 emails into 80-100 buckets with 99.62% accuracy. 97.75% of those emails were spam.
      POPFile started on a windows xp box and then switched to a Liunx box w/Thunderbird sometime in late 04 or early 05. Right now it resides on an Ubuntu 9.04 box with Thunderbird for an email client.

      Thanks for the outstanding application

      I just wanted to update my POPfile stats after another 4 years of use.

      Ongoing stats:
      POPfile has classified 1,412,403 emails into 80-100 buckets with 99.62% accuracy. It is now running on Ubuntu 11.04.

      • Message #1948

        I just wanted to update my POPfile stats after another 4 years of use.

        Thank you for the update. It is good to hear how well POPFile performs with a large number of buckets.

        I use 7 buckets to classify my mail and achieve just over 99.6% accuracy.

        Your installation uses at least 10 times as many buckets as mine and has processed about 10 times more mail than mine.

        How big is your POPFile database file (popfile.db) ?

        Do you use any magnets ?

        Do you ever use the VACUUM command to "defragment" your database ? I'm not suggesting you try this; I'm just curious if you have used this command.

        • Message #2107

          How big is your POPFile database file (popfile.db) ?

          Do you use any magnets ?

          Do you ever use the VACUUM command to "defragment" your database ? I'm not suggesting you try this; I'm just curious if you have used this command.

          Sorry for the extremely delayed response.
          My popfile.db is 32.7MB
          I do not use magnets.
          I have not used the VACUUM command.

          After another 14 months I am at 1,548,225 messages classified at 99.64% accuracy. I currently have about 150 buckets. In reality only about 80-100 are in use. I need to do some house keeping it looks like.

          • Message #2171

            I wanted to update my stats again since I moved everything to a new machine.

            After another 14 months I am at 1,687,926 messages classified with 99.66% accuracy. I just transferred everything to a box running Linux Mint Mate 18.1. I currently have 215 buckets I'm not sure how many are currently active. I am still using Thunderbird as my email client.

            Edit - I cleaned up my buckets and got it down 134 active ones.

  • Message #1365

    I have 3 buckets (spam, inbox and dm) and I've received 7,251 messages since Feb 27, 2010.
    Meanwhile, POPFile has sorted 53 messages into wrong buckets and it's accuracy is 99.26%.

    Naoki

  • Message #1381

    I'd like to contribute with my stats. I have 7 buckets and since 4th March (when I reset the stats after a month of training over 547 emails with 32 errors) I received 4565 emails and got 111 errors with an accuracy of 97.56%. The overall statistics including the initial training is 97.20% (from 2 February 2010). In the last 30 days I received 929 emails with 24 errors with a rate of 97.42%.

    Ciao

    Paolo

    • Message #2071

      I see it is a while since anyone added to this thread, so to encourage the developers I will add my contribution.

      I've been using POPFile for ever on our home Linux server. I recently migrated to new hardware and updated to 1.1.3 under Debian Jessie. As usual, I had to scratch my head slightly over the changes on the Debian side, but after a couple of hours, everything was working again.

      My stats are
      Messages classified: 377,349
      Classification errors: 1,900
      Accuracy: 99.49%
      Last Reset: Sat Jan 12 23:57:20 2008

      (2008 was not my first use of POPFile, just the last reset of the statistics)

      In detail:

      Bucket 	  	Classification Count 	False 	    False
                                              Positives   Negatives
      ars 	  	1,599 (0.42%) 	        0 	    10
      dovecot 	25,102 (6.65%) 	        0 	    23
      john-forums 	946 (0.25%) 	        221 	    45
      john-work 	0 (0.00%) 	        0 	    0
      normal 	  	196,858 (52.16%) 	743 	    853
      radiant 	3,321 (0.88%) 	        4 	    9
      spam 	  	149,389 (39.58%) 	738 	    948
      system 	  	0 (0.00%) 	        0 	    0
      unclassified 	134 (0.03%) 	        169 	
      

      In a nutshell: POPFile is a fantastic piece of software which runs smoothly in the background and does its job very well.

      By the way, I found some useful information including a systemd service file in this blog post:
      http://blog.binchen.org/posts/use-popfile-at-linux.html

      Running POPFile from systemd under Jessie was straightforward. If anybody wants more hints, I can post more information in the Help forum.

  • Message #2091

    Well, hello again. Hope time has been treating all popfile users/developers well.

    Here are my current stats:

    http://www.redlinewhiteline.com/images/popfile2.png

    Nevermind when it says the last reset was, it hasn't been reset since, well, ever. Like last time the "last reset" was simply the last reinstall date.

    A few considerations, as my approach to email and filtering has evolved:

    I've had to create a third bucket. I had been extremely hesitant to do so, but there is some evolution in spam techniques that have necessitated it. Namely, good spammers have largely evolved from incoherent gibberish and largely moved on to mimicking bona fide e-mails for spam purposes and phishing.

    The creation of the "automatic" bucket was due to needing to filter legitimate, automated business email from spam copies. My "personal" bucket has largely done a wonderful job in maintaining a high level of quality, with few false classifications. When popfile is (very rarely) unsure about the "personal" bucket it largely defaults to sending mail to the "unclassified" bucket, although I have encountered a handful of emails which were erroneously sent to spam. This is very, very rare, and happens once every year or two.

    Back to "automatic". I have what is likely a unique setup for my email. I own multiple domains, and I have a catchall setup so that I personally receive all mail that is sent to nonexistent addresses - to a total tune of around 100,000 emails a year. I did this way back in 2004, due to having such a long default domain name (redlinewhiteline.com), as friends and businesses would often manually input the incorrect email. It was a necessary evil that I receive all email, as I was missing out on innumerable personal emails. Solving the issue of receiving voluminous amounts of spam became paramount. Missing personal emails was unacceptable, but my solution involved inundating me with so much spam that I was surely missing the very same personal emails with my visual scanning.

    Enter popfile. At the time I had no need for more than two buckets - I didn't subscribe to anything that I didn't consider urgent, nor did I consider my personal emails anything but urgent, either. So, "personal" - a catchall for everything I wanted to read - and "spam" - the obvious folder. This largely worked up until a year or so ago.

    Around 2012 I began subscribing to more and more completely automated services. Wish lists being among the biggest, but city services have largely abandoned phone calls for email. Bill reminders, service notifications, et cetera. Everything that once came via the postal service now comes via email. Spammers have altered their tactics to match this. (As an aside, from 2012-2013 the largest pain in my back was foreign language + local language (in my case, English) spam, as popfile had never encountered the combination and had no idea how to proceed.) One of the most painful experiences I've ever had with email was staring at what was a carbon copy of an American Express security notice, complete with delivery from an American Express lookalike domain, and seriously questioning whether it was legit or not. If you haven't encountered such spam/phishing you should count yourself lucky. A side thought on this is that companies should really move beyond using incoherent mailer domains (e.g. cc.mailer.amex.com or some other useless shit) and move all urgent automated email to the domain they conduct all of their other business on.

    "automatic" has largely taken care of that style of spam, but I still, occasionally, see carbon copied phishing emails delivered to "unclassified" or "automatic". I also do see legitimate emails being bumped to spam or "unclassified". Looking at the legitimate email and the carbon copy reveals that popfile was struggling to properly classify the phishing email as the only difference is in the reception/delivery headers. That's smart. I don't consider it a failing of popfile that it had some difficulty coping - consulting spam blacklists isn't its purpose, and those blacklists have issues themselves.

    It's easy to see from the stats how popfile has suffered the past few years. From 2013 on I was losing around a fifth of a percent on my accuracy per month. It eventually dropped from a record high of 99.90% to around 99.00% (iirc) before rebounding back to 99.26% recently. That stung. Worse yet, before I classified a large number of foreign language spam, and subsequently created "automatic" to deal with carbon copy spam/phishing, I was suffering a large number of false classifications, both of personal and of automatic natures. I am comfortable with the current accuracy, and it should be a testament to popfile that it can easily cope with large and sweeping change in both how it is being utilized and in how spammers adapt.

    One should also definitely take a look at the word count. Despite receiving less than 1% of overall email, "automatic" now accounts for nearly 22% of the overall word count.

    Lastly, due to a large amount of my email usage now being from mobile devices, I have created a reference gmail that accesses all of the exact same mailboxes. This will allow for a unique case study in how corporate spam solutions compare to popfile, as both will receive the exact same set of emails to parse. I'll likely report back in a year to investigate my findings in a likely equally longwinded and boring post.

    Thanks as always. And, as always, apologies for my lack of brevity and coherency. :)

    Bryce