Training POPFile

Out of the box, POPFile is dumb. It doesn't know what spam is, what e-mail is, or what any of the buckets you specified mean. It takes a little time to train it. Why can't POPFile come “pre-trained” to know what spam is? Because email classification is a subjective exercise. If you are a medical doctor, you might well receive important email with words that for other people would indicate a high likelihood of being spam. POPFile is effective because it is personalized for each user by learning from you.

When POPFile Makes A Mistake (...and it will!)

POPFile's classification system needs to be trained for a while before it becomes effective - the more it's trained, the more effective it becomes. In fact, it won't even classify mail the first time you use it - it will leave it as unclassified. As of POPFilev0.20, by default POPFile marks a message as 'unclassified' if it isn't 100 times more certain it's in bucket A than bucket B. This is to reduce the false positive rate. If you wish to adjust this property, find the bayes_unclassified_weight on the Advanced page.

Whenever POPFile misclassifies an email, or doesn't classify it, head to the web interface and take a look at the 'History' tab (it loads by default). There, you'll see the last twenty or so emails you received, along with how POPFile classified them. (If you want to see why POPFile classified an email how it did, click on the subject line.) For each email that was wrong, correct POPFile by selecting the correct classification in the right-hand column (for one or more messages at a time), then clicking the Reclassify button.

The emails are already stored - you can happily move them around or delete them in your email program without affecting what POPFile thinks about them! POPFile only learns when you reclassify an email - it works under the theory of 'if it ain't broke, don't fix it'.

Quick Guidelines

  1. You must setup at least two buckets and train messages for each bucket.
  2. Give POPFile time. The more you train it, the better it gets.
  3. Correct POPFile when it gets things wrong. If you find two similar emails misclassified, correct them both.
  4. Use magnets sparingly. POPFile doesn't learn from email that matches a magnet so you are losing an opportunity for improved training.
  5. Try using POPFile for more than just Spam and NonSpam classification buckets. There is only a minor difference in accuracy between having the minimum of two buckets and having 3, 7, or 12! More buckets will take a little more time to train, but will save you time by doing more classifying for you.
  6. When you define a bucket, have a firm idea of what goes in it. If you constantly struggle about which bucket something should go into, POPFile will struggle too. This can happen if you set up buckets that sometime overlap. In that case one more general bucket instead of two overlapping buckets is better.

See also:

 
howtos/training.txt · Last modified: 2012/08/27 12:41 by xuesheng

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License