What are the Pseudowords built into POPFile and how do they work?

POPFile uses pseudowords to be able look at HTML display elements and various spammer tricks to handle them as a special case and help improve accuracy. Pseudowords are also used to keep track of what headers appear in a message and what words appear in some of the headers like: From, To, CC, and Subject. POPFile ignores the case of words and does not make any generalizations when a word isn't found in the specific source.

People have queried whether or not using these pseudowords in a message can fool POPFile, like the html:comment pseudoword. When this is done, POPFile sees and stores the words as two separate words, html and comment, so you can't fool POPFile by using pseudowords in messages.

These are the current pseudowords built into POPFile:

cc:<address>
names and addresses in the CC header
charset:<various>
the character set listed in the message
encoding:XXX
This one notes the content transfer encoding of a message. XXX might be 8BIT or BASE-64, e.g..
from:<address>
names and addresses in the From header
header:<various>
headers present in a message
header:XXX
XXX can be any kind of header that might or should be used in an email, e.g, Date:, Subject:, but also less common ones such as Precedence: or List-Unsubscribe.
html:authorization
another way to conceal URLs is with the authorization syntax, e.g. http://microsoft.com@somewhereelse.com will not get you to one of microsoft's sites, but to somewhereelse.com
html:backcolorXXX
marks the background color of a html message.
html:colordistanceXX
gets set when a message uses low contrast between foreground and background. Possibly to hide certain words.
html:comment
HTML comments can be used to hide words from dumb filters, while users get to see them just fine:
VIA<!–thisisacomment–>GRA.
html:cidsrc
image source referencing an attachment by its cid
html:css*color<color>
various ways of defining colors with CSS
html:cssdisplay<value>
display value defined with CSS
html:cssfontsize<size>
font size defined with CSS
html:cssvisibility<value>
visibility value defined with CSS
html:emptypair
Yet another variation of the hide-this-word-from filters trickery. An empty pair of tags is slipped inside the word: VIAGRA
html:encodedurl
Spammers will sometimes try to conceal URLs they want you to click. They encode them with someting similar to numeric html entities and hope that… What?
html:fontcolorXXX
XXX can be any color used for any html element.
html:fontsizeXXX
this one keeps track of the font sizes used in a message.
html:iframeremotesrc
an iframe in the mail has a remote source
html:imgheightXXX
XXX is the height of an embedded image.
html:imgremotesrc
is triggered when an email references a picture that is supposed to be loaded from the web.
html:imgwidthXXX
XXX is the width of an embedded image.
html:invalidtag
This one works just like the comment trick. Instead of placing a comment inside a word, spammers use invalid (thought up) html tags inside the word. The effect is the same.
html:numericentity
is set when a message contains a html numeric entity, like &amp;#86;. Numeric entities can be used to display special characters, like the Euro symbol. But they also can be used for normal characters, when spammers are trying to hide give-away words from filters. E.g. this spells 'VIAGRA': &amp;#86;&amp;#73;&amp;#65;&amp;#71;&∓#82;&amp;#65;
html:td
this one keeps track of the number of html table cells in an email. Usually, emails are just paragraphs of text. When they contain tables and when those tables contain many cells, something different may be going on.
mimeextension:XXX
Again, this is about the file name of an attachment. But this pseudowords stores only the extension.
mimename:XXX
If a message has a mime-encoded attachment, POPFile stores the file name of the attachment in this pseudoword.
spamassassin:<various>
SpamAssassin tests
spamassassinlevel:spam
counted once for every full point of SpamAssassin level
subject:<various>
words found in the Subject header
to:<address>
names and addresses in the To header
trick:spacedout
this gets set when a string is broken up by spaces or other random characters between the characters of a word.
eg: V I A G R A or V.I.A.G.R.A are easy to read, but break the string “VIAGRA”.
trick:dottedwords
this is a fairly simple trick where words have dots in random places. eg: Mort.gage
trick:invisibleink
to fool filters like POPfile spammers try to insert harmless words in their messages. You are not supposed to see them, but they expect your filter to see them. So they simply set the color of the word to the color of the background.

More information about spammer trickery that underlies some of these pseudowords can be found in The Spammers' Compendium

See also

 
faq/pseudowords.txt · Last modified: 2013/11/05 13:58 by xuesheng

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License