This shows you the differences between two versions of the page.
— | devel:corpusaging [2008/02/08 19:49] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ===== Corpus Aging ===== | ||
+ | |||
+ | The following changes to POPFile v 0.21.0 enable tracking of the ' | ||
+ | |||
+ | 1. Add the following trigger to the matrix by using the SQLite commandline utility | ||
+ | <code sql> | ||
+ | create trigger insert_matrix after insert on matrix | ||
+ | begin | ||
+ | update matrix set lastseen = date(' | ||
+ | end; | ||
+ | </ | ||
+ | |||
+ | 2. Modify Bayes.pm version 1.289 by inserting the following code at line 1791 | ||
+ | <code perl> | ||
+ | # | ||
+ | # Mark words lastseen unless we are in message view mode | ||
+ | # | ||
+ | |||
+ | | ||
+ | | ||
+ | my $bucketid = $self-> | ||
+ | my $seeninbucket = $self-> | ||
+ | set lastseen = date(' | ||
+ | where wordid in ( $ids) and | ||
+ | bucketid = $bucketid ;" ); | ||
+ | | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | With the above changes, POPFile will start tracking the ' | ||
+ | |||
+ | These changes were made to my installation back on December 15, 2003, so my POPFile has been tracking the lastseen date for 90+ days now. The current ' | ||
+ | |||
+ | < | ||
+ | Corpus Aging Prepared Sat 20-Mar-2004 | ||
+ | Under | ||
+ | Bucket | ||
+ | ------------ ----- ----- ----- ----- ----- ----- ----- ------- | ||
+ | magnet | ||
+ | normal | ||
+ | spam 7567 1107 | ||
+ | unclassified | ||
+ | ------------ ----- ----- ----- ----- ----- ----- ----- ------- | ||
+ | Totals | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The above report was produced with the script CorpusAge . | ||
+ | |||
+ | Updated stats after 184 days of running POPFile with ' | ||
+ | < | ||
+ | Corpus Aging Prepared Wed 16-Jun-2004 | ||
+ | Under | ||
+ | Bucket | ||
+ | ------------ ----- ----- ----- ----- ----- ----- ----- ------- | ||
+ | magnet | ||
+ | normal | ||
+ | spam 7605 1080 | ||
+ | unclassified | ||
+ | ------------ ----- ----- ----- ----- ----- ----- ----- ------- | ||
+ | Totals | ||
+ | | ||
+ | </ | ||
+ | |||
+ | ==== Next Step, Deletion ==== | ||
+ | |||
+ | The next step will be to delete words over a certain number of days old. | ||
+ | |||
+ | With hapaxes, it's a no brainer, simply delete the word, but some thought needs to be given to the issue of dealing with words that appear in more than one bucket since deleting those not seen would change the probability on those seen. | ||
+ | |||
+ | This might not be bad because if the word isn't seen in bucket A, but is seen in Bucket B, and the aging deletes the word from bucket A, the end result is to strengthen the probability for bucket B. That's fine if the word was already weighted towards bucket B, but what about words that were essentially neutral (50/50 probability for A or B)? Deleting one now seriously impacts the balance. Or worse, use the case of an unseen word weighted 80/20 for bucket A/B, delete from A and now the word that previously had a low probability for bucket B is suddenly 100% bucket B. | ||
+ | |||
+ | Maybe deletion should only be performed on hapaxes to ensure that we do not mess up the probabilities? | ||
Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.