POPFile makes use of Bayes Theorem in the form:

P(E|Bi) x P(Bi) P(Bi|E) = --------------- P(E)

Where there are n buckets B1 to Bn and there are m words in total W1 to Wm. We want to know for a specific email E which bucket it is most likely to belong to.

Here P(Bi|E) is the probability that email E is in bucket Bi; that is the probability that given a set of words E they appear in bucket Bi.

P(E|Bi) is the probability that for a given bucket Bi the words in E appear in that bucket.

P(Bi) is the probability of a given bucket; that is the probability of any email being in bucket Bi.

P(E) is the probability of that specific email occuring.

To calculate which bucket E should go in we need to calculate P(Bi|E) for each of the buckets and find the largest. Since each of those calculations involves the value P(E) we just ignore it and pretend that we need to calculate

P(Bi|E) = P(E|Bi) x P(Bi)

First E is split into the set of words in E, call them E1 through Eo. To calculate P(E|Bi) we calculate the product of the probabilities for each word. That is the likelihood that each word appears in Bi. Here's the “naive” step; we assume that words appear independent from other words which is clearly not true for most languages!

P(E|Bi) = P(E1|Bi) x P(E2|Bi) x ... x P(Eo|Bi)

POPFile ignores the term P(E) since it is the same for each calculation and calculates:

(1) P(Bi|E) = P(E1|Bi) x P(E2|Bi) x ... x P(Eo|Bi) x P(Bi)

And picks the largest to determine which bucket the email belongs in. But equation (1) works well in theory and badly in practice. That chain of multiplications quickly leads to underflow in many languages because the numbers involved are tiny fractions and multiplying many together results in smaller and smaller numbers until the floating point system cannot cope any more.

One way round this is to use logarithms. Remembering the rule of logs that:

log A + log B = log AB

POPFile calculates the following:

(2) log P(Bi|E) = log P(E1|Bi) + log P(E2|Bi) + ... + P(Eo|Bi) + log P(Bi)

and then chooses the largest value of log P(Bi|E) in determining the bucket. By precalculating the log values for each word and bucket and by the fact that the multiplication has been replaced by addition the calculation is more accurate and faster.

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.