Corpus Corruption 0.20.x

This applies to POPFile 0.20.x only. It may be be helpful if you are having trouble upgrading from 0.20.x. The problems described no longer affect the current version of POPFile.

POPFile's corpus is stored in BerkeleyDB table files, one table.db file for each bucket that comprises the corpus. Under certain conditions, it is possible for a table.db file to become corrupt, e.g., system crashes or forcefully killing POPFile before critical table information was written to disk. The table.db files contain:

  • The word list for a bucket along with the corresponding word counts
  • The Total Count of Unique Words in the Bucket
  • The Total Wordcount for the Bucket

What Causes Corruption?

We believe corpus corruption can happen in two different ways:

  • Killing (terminating) the POPFile program abormally may leave the corpus structure on disk in an incomplete state.
  • Reclassifying messages via the UI on systems that have the allow concurrent POP3 connections option on the configuration page set to YES may cause corruption if:
    1. Mail is being retrieved by your mail client while you are reclassifying, and,
    2. The reclassification results in the corpus database overflowing, causing the database to expand, and,
    3. The reclassification finishes before the child that was forked off to handle the POP3 mail retrieval finishes.

Avoiding Corruption

  • Always shutdown POPFile via the shutdown link on the UI. If you must terminate it in some other manner, make sure your mail client is shutdown completely first.
  • Do not reclassify messages while your mail client is in the process of retrieving mail.

Symptoms of a Corrupt Corpus

Possible symptoms of a corrupt corpus include;

  • The wordcount or unique words for the bucket are blank on the Buckets page of the UI
  • POPFile fails to start, it dies during the startup process when it encounters the corrupt corpus.
  • A sudden drop in the number of unique words in a bucket.
  • A sudden increase in classification errors due to loss of information in one or more buckets of the corpus.
  • If you are running in foreground mode, you see the error message:
    POPFile Engine v0.20.1 running
    Illegal division by zero at C:\Program Files\POPFile/Classifier/Bayes.pm line 37
    4, <GEN3> line 642.
    
  • Or similar, indicating a situation that can only happen if the corpus is corrupt.
  • Your mail client times out all the time when trying to retrieve mail. This happens because the above error is interferring with the classification, so the POP3 session dies prematurely.

Confirming that you really have a corrupt corpus

The utility dbverify - checking for a corrupt corpus can be used to check your corpus and identify any buckets that have corruption. If the utility does not report corruption, your corpus is ok.

Dealing with the Corruption

Once a table.db is corrupted, your choices to rectify the situation are:

  • Shutdown POPFile and delete the corrupt table.db file(s). When you restart POPFile, new empty table.db files will be built. This choice means you loose your current corpus and you must re-train POPFile.
  • Fall back to a backup copy of the corpus file(s). You do have backups, right?
  • If you upgraded to v 0.20.x from a prior version of POPFile, fall back to the backup corpus that was made by the Windows installer. That backup will be located in the backup directory. Each bucket will have a subdirectory that contains a 'table' file.
    1. Delete the table.db file in your corpus folder (it will be in a subdir named after each bucket).
    2. Copy the 'table' file from the backup location to the corpus folder's subdir
    3. restart POPFile, POPFile will autoconvert the pre v 0.20.x corpus to a new table.db file
  • If you are technically inclined, you can attempt to recover the contents of the corrupt corpus using the unsupported utility cunload.
    1. Download the cunload utility to your POPFile installation directory by right clicking on this link and picking save target as http://www.geocities.com/helphand1/popfile/0_20/cunload.pl
    2. Shutdown POPFile
    3. Open a DOS box and run the utility
      cd "\program files\popfile"
      perl cunload.pl
      
    4. If you have a corrupt corpus, the utility will advise you of the bad bucket with an error message similar to this:
      C:\program files\popfile>perl cunload.pl
      Checking corpus/magnet/table.db
      Checking corpus/normal/table.db
      Checking corpus/spam/table.db
          *ERROR** bucket corpus/spam has a corrupt corpus,
      db_verify returns: DB_VERIFY_BAD: Database verification failed
      Bucket corpus/spam is likely corrupt, word count is 10882 versus 12687
      Bucket corpus/spam is likely corrupt, unique count is 3148 versus 3912
      
    5. At this point, you have a choice. Either delete the bad bucket entirely, or go with whatever words from the bucket that the cunload utility was able to recover. There is no right answer here, neither choice is very attractive since no matter which you choose, you will suffer data loss of the missing words. You should use your own judgement in deciding what is best for you. In making your decision, consider the fact that if a lot of data was lost from a bucket, and you decide to retain what was recoverable instead of starting over and re-training POPFile, you may create an unbalanced corpus situation that negatively impacts classification accuracy. In many cases, it is easier and quicker to re-train than to recover from an unbalanced corpus.
      1. If you choose to completely delete the offending bucket, then issue the following commands (example assumes the bucketname is spam, use your actual bucket name):
        del corpus\spam\table
        del corpus\spam\table.db
        
      2. If you choose to retain what the cunload utility was able to recover, then issue the following command (again, example assumes the bucketname is spam, use your actual bucket name):
        del corpus\spam\table.db
        
    6. Exit the DOS box and restart POPFile (IMPORTANT NOTE: POPFile will be re-converting your flat file unloaded buckets to BerkeleyDB format when you start it up, this may take some time depending on your corpus size, be patient and do NOT abort or kill POPFile or you will simply create another corrupt corpus.)
 
troubleshooting/corruptcorpus020.txt · Last modified: 2008/02/08 19:49 (external edit)

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License