Differences
This shows you the differences between two versions of the page.
| |
— | faq:commonheaders [2008/02/08 19:49] (current) – created - external edit 127.0.0.1 |
---|
| ===== Why are common headers names marked as a specific bucket? ===== |
| |
| You may have noticed that the names of many email headers (//Received//, //Content-Type//, //Date//, //Return-Path//, //Message-Id//) that are common to almost all email are most likely found in a certain bucket. This is not a problem even if most of them are considered spam words. In a mature corpus they are not going to be significantly weighted toward any single bucket. Just because a word is colored for a certain bucket doesn't mean it is not used in classification for other buckets. And in combination with a few words that more strongly indicate the actual classification the correct bucket is chosen. |
| |
| On some headers the case of the header may indicate something useful. For example, //header:Message-ID// and //header:Message-Id// or //header:MIME-Version// and //header:Mime-Version// may give you different results. The //To//, //From//, and //Subject// headers are on the ignore list because they are always there and always in the same form so really aren't useful in classification. |
| |
| In the //Recieved// header's case the number of //Recieved// headers is important. Depending on how your email is setup a lot of recieved lines may indicate spam, newsletter, or a certain email account. reach your email server. So the more recieved headers the more likely the message is to be spam. |
| |