How POPFile does email classification

POPFile uses a technique called Naive Bayes to calculate the probability that the words in an email mean that that email falls into a specific bucket.

A bucket is represented by a collection of words and their frequency. The set of buckets is called the corpus and determines the different buckets that an email can be placed in, the probability of an individual word existing in an email for a specific bucket and the probability of an email being in a bucket to start with.

Suppose there are n buckets B1 to Bn and there are m words in total W1 to Wm. We want to know for a specific email E which bucket it is most likely to belong to.

We want to calculate the P(Bi|E) for each bucket Bi. That calculation can be performed using Bayes rule as follows

		   P(E|Bi) x P(Bi)
	P(Bi|E) =  ---------------

Here P(Bi|E) is the probability that email E is in bucket Bi; that is the probability that given a set of words E they appear in bucket Bi.

P(E|Bi) is the probability that for a given bucket Bi the words in E appear in that bucket.

P(Bi) is the probability of a given bucket; that is the probability of any email being in bucket Bi.

P(E) is the probability of that specific email occuring.

To calculate which bucket E should go in we need to calculate P(Bi|E) for each of the buckets and find the largest. Since each of those calculations involves the value P(E) we just ignore it and pretend that we need to calculate

	P(Bi|E) = P(E|Bi) x P(Bi)

First E is split into the set of words in E, call them E1 through Eo. To calculate P(E|Bi) we calculate the product of the probabilities for each word. That is the likelihood that each word appears in Bi. Here's the “naive” step; we assume that words appear independent from other words which is clearly not true for most languages!

	P(E|Bi) = P(E1|Bi) x P(E2|Bi) x ... x P(Eo|Bi)

For any bucket P(Ej|Bi) is calculated as the number of times Ej appears in Bi divided by the total number of words in Bi.

P(Bi) is calculated as the total number of words in Bi divided by the total number of words in all the bucket put together.

Finally we calculate P(Bi|E) as

	P(Bi|E) = P(E1|Bi) x P(E2|Bi) x ... x P(Eo|Bi) x P(Bi)

for each bucket and pick the largest.

faq/howitworks.txt · Last modified: 2009/04/07 13:51 by xuesheng

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License