Corpus Aging

The following changes to POPFile v 0.21.0 enable tracking of the 'lastseen' date in the matrix, which opens the door to corpus aging.

1. Add the following trigger to the matrix by using the SQLite commandline utility

CREATE TRIGGER insert_matrix after INSERT ON matrix
BEGIN
  UPDATE matrix SET lastseen = DATE('now') WHERE id = NEW.id;
END;

2. Modify Bayes.pm version 1.289 by inserting the following code at line 1791

   #
   # Mark words lastseen unless we are in message view mode
   # 
 
   unless (defined($ui)) {
       $self->{db__}->begin_work;
       my $bucketid = $self->{db_bucketid__}{$userid}{$class}{id};
       my $seeninbucket = $self->{db__}->do( "update matrix
                                  set lastseen = date('now')
                                  where  wordid in ( $ids) and
                                  bucketid = $bucketid ;" );
       $self->{db__}->commit;
   }

With the above changes, POPFile will start tracking the 'lastseen' date for any word inserted into the matrix or seen in an email being classified. Words are marked for the bucket that the message is classified into during the classification process.

These changes were made to my installation back on December 15, 2003, so my POPFile has been tracking the lastseen date for 90+ days now. The current 'aging' of the words in my corpus is as follows:

 
Corpus Aging Prepared Sat 20-Mar-2004
            Under                               Over    No
Bucket        15   15-29 30-44 45-59 60-74 75-89  90    Date
------------ ----- ----- ----- ----- ----- ----- ----- -------
magnet           0     0     0     0     0     0     0       0
normal        7837   943   530   241   186   292   335    2104
spam          7567  1107   429   240   147   284   349    4486
unclassified     0     0     0     0     0     0     0       0
------------ ----- ----- ----- ----- ----- ----- ----- -------
Totals       15404  2050   959   481   333   576   684    6590
             56.9   7.6   3.5   1.8   1.2   2.1   2.5    24.3

The above report was produced with the script CorpusAge .

Updated stats after 184 days of running POPFile with 'lastseen' date tracking. During that timeframe, no deletion of words has taken place. The 'no date' column would represent words lastseen over 184 days ago.

 
Corpus Aging Prepared Wed 16-Jun-2004
            Under                               Over    No
Bucket        15   15-29 30-44 45-59 60-74 75-89  90    Date
------------ ----- ----- ----- ----- ----- ----- ----- -------
magnet           0     0     0     0     0     0     0       0
normal        8842   935   453   270   165   201   959    1611
spam          7605  1080   917   989   508   350  1319    3590
unclassified     0     0     0     0     0     0     0       0
------------ ----- ----- ----- ----- ----- ----- ----- -------
Totals       16447  2015  1370  1259   673   551  2278    5201
             55.2   6.8   4.6   4.2   2.3   1.8   7.6    17.5

Next Step, Deletion

The next step will be to delete words over a certain number of days old.

With hapaxes, it's a no brainer, simply delete the word, but some thought needs to be given to the issue of dealing with words that appear in more than one bucket since deleting those not seen would change the probability on those seen.

This might not be bad because if the word isn't seen in bucket A, but is seen in Bucket B, and the aging deletes the word from bucket A, the end result is to strengthen the probability for bucket B. That's fine if the word was already weighted towards bucket B, but what about words that were essentially neutral (50/50 probability for A or B)? Deleting one now seriously impacts the balance. Or worse, use the case of an unseen word weighted 80/20 for bucket A/B, delete from A and now the word that previously had a low probability for bucket B is suddenly 100% bucket B.

Maybe deletion should only be performed on hapaxes to ensure that we do not mess up the probabilities?

 
corpusaging.txt · Last modified: 2007/03/02 13:45 by 127.0.0.1

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License