What variables affect 'good' statistical ratings?

There are a number of ways to achieve high statistical ratings, but they will vary from person to person as we all get different types of email. Here are a few of these:

Bucket Distinction. POPFile works best when your buckets are configured to sort distinctly different types of email. Mail like pure unrequested email (spam) versus an advertisement that you don't want from a retailer you frequent and are on their mailing list is difficult to tell a part. However, if given enough time and enough training POPFile will be better able to tell similar types of mail apart.

Training Consistency. One of the biggest keys to good results is consistent training. This means always train on errors, and always put the mail in the right bucket. Often people aren't sure of what bucket to reclassify a message into after initially setting up their buckets, which refers back to clear bucket distinction.

Magnet Use. One way of getting artifically higher statistics is to use a magnet if you know for a fact that a message from a particular sender or with a particular subject will always be indicative of a specific type of mail. Using a magnet to send particular types of mail into a specific bucket bypasses the automatic classifier all together. While this can raise your accuracy statistics, it does not allow POPFile to learn from those messages which if overused can negatively affect POPFile's overall accuracy. Many people are getting high stats without using them.

Language. POPFile can have some issues when classifying messages from different languages as the grammar rules are often very different. Some people send and receive email in different languages. Here again, if given enough time, POPFile should be able to learn from the examples given and the statistics will rise. This is also an area that is being developed more, so this should improve as POPFile evolves.

Message Length. Most email tends to be long enough to give POPFile enough data to work on. Usually POPFile can find enough classification data in the headers if the body content is very light. There are also pseudoword indicators that can help POPFile to determine the classification in these cases, for example if the message just contains an image, as many spam do. Extremely short emails are particularly tough to deal with and may appear as unclassified.

POPFile - Automatic Email Classification

What variables affect 'good' statistical ratings?

What variables affect 'good' statistical ratings?