POPFile - Automatic Email Classification

Why doesn't word salad work?

Spammers have started using “ word salad” to get past spam filters. Spammers either randomly use a lot of unspammy dictionary words or they just load up the email with made up words. Sometimes it may also be as a news article or section of text from a book.

Using lots of uncommon words might be effective against some other filters if they consider unknown words as non-spammy. But instead of treating them as non-spammy, POPFile considers unknown words as unlikely to be in any bucket by assigning them a very small value depending on the size of the bucket. The key thing is unseen words are rougly equally weighted (based on bucket size) between the various buckets, so their effect is neutral.

In POPFile everyone's spam and nonspam words are specific to their own email so loading messages with word salad is not effective. Spammers aren't able to find words that are going to be non-spammy for everyone. Often simple words that seem non-spammy are actually spammy. By coincidence, our simple example word is “simple.” It was brought up in discussion of word salad and had widely varying spamminess. For four out of seven users who checked, it was a good spam indicator.

User	Status of the word 'simple'
Brian	very spammy, 0.82 probability
James	low probability
Jeremiah	spammy
Jim	spammy
Joseph	far higher probability in school mail, 0.64
Robbie	spammy with 0.81 probability
Troy	didn't appear at all in any bucket

John indicated that a _random_ (essentially worst-case or “brute force”) word-salad attack worked in some small percentage of cases in his presentation at the 2004 MIT Spam Conference.

The main point is that may be a possibility to get a small percentage of messages through to a small percentage of people by using lots of word salad. But then, how are they going to advertized their enlargment pills? Its not very effective spam if it doesn't include a URL so that will still be there. And don't forget email headers also heavily contribute to classification in POPFile.

Also See: NewWords