POPFile - Automatic Email Classification

This is an old revision of the document!

Corpus Aging

The following changes to POPFile v 0.21.0 enable tracking of the 'lastseen' date in the matrix, which opens the door to corpus aging.

1. Add the following trigger to the matrix by using the SQLite commandline utility

 
create trigger insert_matrix after insert on matrix
begin
<code>  update matrix set lastseen = date('now') where id = new.id;

end; </code> 2. Modify Bayes.pm version 1.289 by inserting the following code at line 1791

 <code>
   #
   # Mark words lastseen unless we are in message view mode
   #

   unless (defined($ui)) {
       $self->{db%%__%%}->begin_work;
       my $bucketid = $self->{db_bucketid%%__%%}{$userid}{$class}{id};
       my $seeninbucket = $self->{db%%__%%}->do( "update matrix
                                  set lastseen = date('now')
                                  where  wordid in ( $ids) and
                                  bucketid = $bucketid ;" );
       $self->{db%%__%%}->commit;
   }

</code>

With the above changes, POPFile will start tracking the 'lastseen' date for any word inserted into the matrix or seen in an email being classified. Words are marked for the bucket that the message is classified into during the classification process.

These changes were made to my installation back on December 15, 2003, so my POPFile has been tracking the lastseen date for 90+ days now. The current 'aging' of the words in my corpus is as follows:

 
Corpus Aging Prepared Sat 20-Mar-2004
<code>            Under                               Over    No

Bucket 15 15-29 30-44 45-59 60-74 75-89 90 Date ———— —– —– —– —– —– —– —– ——- magnet 0 0 0 0 0 0 0 0 normal 7837 943 530 241 186 292 335 2104 spam 7567 1107 429 240 147 284 349 4486 unclassified 0 0 0 0 0 0 0 0 ———— —– —– —– —– —– —– —– ——- Totals 15404 2050 959 481 333 576 684 6590

             56.9   7.6   3.5   1.8   1.2   2.1   2.5    24.3

</code>

The above report was produced with the script CorpusAge .

Updated stats after 184 days of running POPFile with 'lastseen' date tracking. During that timeframe, no deletion of words has taken place. The 'no date' column would represent words lastseen over 184 days ago.

 
Corpus Aging Prepared Wed 16-Jun-2004
<code>            Under                               Over    No

Bucket 15 15-29 30-44 45-59 60-74 75-89 90 Date ———— —– —– —– —– —– —– —– ——- magnet 0 0 0 0 0 0 0 0 normal 8842 935 453 270 165 201 959 1611 spam 7605 1080 917 989 508 350 1319 3590 unclassified 0 0 0 0 0 0 0 0 ———— —– —– —– —– —– —– —– ——- Totals 16447 2015 1370 1259 673 551 2278 5201

             55.2   6.8   4.6   4.2   2.3   1.8   7.6    17.5

</code>

Next Step, Deletion

The next step will be to delete words over a certain number of days old.

With hapaxes, it's a no brainer, simply delete the word, but some thought needs to be given to the issue of dealing with words that appear in more than one bucket since deleting those not seen would change the probability on those seen.

This might not be bad because if the word isn't seen in bucket A, but is seen in Bucket B, and the aging deletes the word from bucket A, the end result is to strengthen the probability for bucket B. That's fine if the word was already weighted towards bucket B, but what about words that were essentially neutral (50/50 probability for A or B)? Deleting one now seriously impacts the balance. Or worse, use the case of an unseen word weighted 80/20 for bucket A/B, delete from A and now the word that previously had a low probability for bucket B is suddenly 100% bucket B.

Maybe deletion should only be performed on hapaxes to ensure that we do not mess up the probabilities?