Is it possible to 'unbalance' your corpus?

While the research hasn't been fully conclusive, it does not appear that buckets will become unbalanced as long as you continue training on errors, which seems to be the key. Even users who receive inordinate amounts of one type of mail usually see that POPFile effectively sorts messages with high accuracy even though their word counts aren't evenly distributed among buckets.

Three situations that can lead to an unbalanced corpus include:

  1. Deleting a mature bucket that contained a significant portion of the overall corpus' words. By removing those words, the remaining buckets that shared words with the deleted bucket have their probabilities increased for those shared words. This may bias classifications towards those buckets until enough reclassifications occur to make up for the impact.
  2. Erasing all the words from a mature bucket can have the same impact.
  3. Adding a new bucket to a mature corpus setup that is not likely to receive a lot of messages can quickly cause standard message headers to be significantly weighted toward the new bucket after getting a few messages reclassified. This will mostly only affect very short messages where most of the classification comes from the headers. The problem will go away as this bucket gets more training on different words making its few words and headers less significant. If it becomes a problem you can always delete the new bucket and go back to how your setup was before.

As a general rule of thumb, if you are performing a major reorganization of your buckets by deleting buckets, erasing words from them, or adding a lot of buckets, you will get the best results by simply erasing all buckets and re-starting the training.

It is also a good idea to reset your training after first using POPFile for a month or two. By that time you will probably have a clearer idea of what goes in each of your buckets and overall how POPFile works so your accuracy should go up a good bit.

 
faq/corpusunbalance.txt · Last modified: 2008/02/08 19:49 (external edit)

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License