While the research hasn't been fully conclusive, it does not appear that buckets will become unbalanced as long as you continue training on errors, which seems to be the key. Even users who receive inordinate amounts of one type of mail usually see that POPFile effectively sorts messages with high accuracy even though their word counts aren't evenly distributed among buckets.
Three situations that can lead to an unbalanced corpus include:
As a general rule of thumb, if you are performing a major reorganization of your buckets by deleting buckets, erasing words from them, or adding a lot of buckets, you will get the best results by simply erasing all buckets and re-starting the training.
It is also a good idea to reset your training after first using POPFile for a month or two. By that time you will probably have a clearer idea of what goes in each of your buckets and overall how POPFile works so your accuracy should go up a good bit.
Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.