Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+===== Why are common headers names marked as a specific bucket? =====
+You may have noticed that the names of many email headers (//Received//, //Content-Type//, //Date//, //Return-Path//, //Message-Id//) that are common to almost all email are most likely found in a certain bucket.  This is not a problem even if most of them are considered spam words.  In a mature corpus they are not going to be significantly weighted toward any single bucket.  Just because a word is colored for a certain bucket doesn't mean it is not used in classification for other buckets.  And in combination with a few words that more strongly indicate the actual classification the correct bucket is chosen.
+On some headers the case of the header may indicate something useful.  For example, //header:Message-ID// and //header:Message-Id// or //header:MIME-Version// and //header:Mime-Version// may give you different results.  The //To//, //From//, and //Subject// headers are on the ignore list because they are always there and always in the same form so really aren't useful in classification.
+In the //Recieved// header's case the number of //Recieved// headers is important.   Depending on how your email is setup a lot of recieved lines may indicate spam, newsletter, or a certain email account.  reach your email server.  So the more recieved headers the more likely the message is to be spam.

POPFile - Automatic Email Classification