What are the Pseudowords built into POPFile and how do they work?

POPFile uses pseudowords to be able look at HTML display elements and various spammer tricks to handle them as a special case and help improve accuracy. Pseudowords are also used to keep track of what headers appear in a message and what words appear in some of the headers like: From, To, CC, and Subject. POPFile ignores the case of words and does not make any generalizations when a word isn't found in the specific source.

People have queried whether or not using these pseudowords in a message can fool POPFile, like the html:comment pseudoword. When this is done, POPFile sees and stores the words as two separate words, html and comment, so you can't fool POPFile by using pseudowords in messages.

These are the current pseudowords built into POPFile:

? cc:<address> ! names and addresses in the CC header

? charset:<various> ! the character set listed in the message ? encoding:XXX ! This one notes the content transfer encoding of a message. XXX might be 8BIT or BASE-64, e.g.. ? from:<address> ! names and addresses in the From header ? header:<various> ! headers present in a message ? header:XXX ! XXX can be any kind of header that might or should be used in an email, e.g, Date:, Subject:, but also less common ones such as Precedence: or List-Unsubscribe. ? html:authorization !another way to conceal URLs is with the authorization syntax, e.g. http://[email protected] will not get you to one of microsoft's sites, but to somewhereelse.com ? html:backcolorXXX ! marks the background color of a html message. ? html:colordistanceXX ! gets set when a message uses low contrast between foreground and background. Possibly to hide certain words. ? html:comment ! HTML comments can be used to hide words from dumb filters, while users get to see them just fine:
VIA<!–thisisacomment–>GRA. ? html:cidsrc ! image source referencing an attachment by its cid ? html:css*color<color> ! various ways of defining colors with CSS ? html:cssdisplay<value> ! display value defined with CSS ? html:cssfontsize<size> ! font size defined with CSS ? html:cssvisibility<value> ! visibility value defined with CSS ? html:emptypair ! Yet another variation of the hide-this-word-from filters trickery. An empty pair of tags is slipped inside the word: VIAGRA ? html:encodedurl ! Spammers will sometimes try to conceal URLs they want you to click. They encode them with someting similar to numeric html entities and hope that… What? ? html:fontcolorXXX ! XXX can be any color used for any html element. ? html:fontsizeXXX ! this one keeps track of the font sizes used in a message. ? html:iframeremotesrc ! an iframe in the mail has a remote source ? html:imgheightXXX ! XXX is the height of an embedded image. ? html:imgremotesrc ! is triggered when an email references a picture that is supposed to be loaded from the web. ? html:imgwidthXXX ! XXX is the width of an embedded image. ? html:invalidtag ! This one works just like the comment trick. Instead of placing a comment inside a word, spammers use invalid (thought up) html tags inside the word. The effect is the same. ? html:numericentity ! is set when a message contains a html numeric entity, like &#86;. Numeric entities can be used to display special characters, like the Euro symbol. But they also can be used for normal characters, when spammers are trying to hide give-away words from filters. E.g. this spells 'VIAGRA': &#86;&#73;&#65;&#71;&∓#82;&#65; ? html:td ! this one keeps track of the number of html table cells in an email. Usually, emails are just paragraphs of text. When they contain tables and when those tables contain many cells, something different may be going on. ? mimeextension:XXX ! Again, this is about the file name of an attachment. But this pseudowords stores only the extension. ? mimename:XXX ! If a message has a mime-encoded attachment, POPFile stores the file name of the attachment in this pseudoword. ? spamassassin:<various> ! SpamAssassin tests ? spamassassinlevel:spam ! counted once for every full point of SpamAssassin level ? subject:<various> ! words found in the Subject header ? to:<address> ! names and addresses in the To header ? trick:spacedout ! this gets set when a string is broken up by spaces or other random characters between the characters of a word.
eg: V I A G R A or V.I.A.G.R.A are easy to read, but break the string “VIAGRA”. ? trick:dottedwords ! this is a fairly simple trick where words have dots in random places. eg: Mort.gage ? trick:invisibleink ! to fool filters like POPfile spammers try to insert harmless words in their messages. You are not supposed to see them, but they expect your filter to see them. So they simply set the color of the word to the color of the background.

More information about spammer trickery that underlies some of these pseudowords can be found in The Spammers' Compendium

POPFile - Automatic Email Classification

What are the Pseudowords built into POPFile and how do they work?

What are the Pseudowords built into POPFile and how do they work?

See also