POPFile - Automatic Email Classification

Why are common headers names marked as a specific bucket?

You may have noticed that the names of many email headers (Received, Content-Type, Date, Return-Path, Message-Id) that are common to almost all email are most likely found in a certain bucket. This is not a problem even if most of them are considered spam words. In a mature corpus they are not going to be significantly weighted toward any single bucket. Just because a word is colored for a certain bucket doesn't mean it is not used in classification for other buckets. And in combination with a few words that more strongly indicate the actual classification the correct bucket is chosen.

On some headers the case of the header may indicate something useful. For example, header:Message-ID and header:Message-Id or header:MIME-Version and header:Mime-Version may give you different results. The To, From, and Subject headers are on the ignore list because they are always there and always in the same form so really aren't useful in classification.

In the Recieved header's case the number of Recieved headers is important. Depending on how your email is setup a lot of recieved lines may indicate spam, newsletter, or a certain email account. reach your email server. So the more recieved headers the more likely the message is to be spam.