Corpus Corruption

Of course, there should be no possibility of corruption and we are trying to do what ever we can to prevent it.

What Causes Corruption?

Corpus corruption may be caused by killing (terminating) the POPFile program abnormally as this may leave the corpus structure on disk in an incomplete state. This can happen when your computer is turned off or restarted without shutting down.

Symptoms of a Corrupt Corpus

Possible symptoms of a corrupt corpus include:

  • The word count for the bucket are blank on the Buckets page of the UI
  • POPFile fails to start, it dies during the startup process when it encounters the corrupt corpus.
  • A sudden drop in the number of unique words in a bucket.
  • A sudden large increase in classification errors due to loss of information in one or more buckets of the corpus.
  • Your mail client times out all the time when trying to retrieve mail. This happens because the above error is interferring with the classification, so the POP3 session dies prematurely.

Confirming that you really have a corrupt corpus

Shutdown POPFile, then use the SQLite command-line utility to check the database (popfile.db).

Be careful when using the SQLite command-line utility - it is a powerful utility which can be used to alter the contents of the database so a mistake made when using it could corrupt the database and stop POPFile from working!

Windows users have several options:

A. Use the POPFile SQLite Database Status Check (a small Windows program) to check the database and display the results. The installer for 0.22.3 (or later) installs this program and creates a Start Menu entry for it:

Start –> Programs –> POPFile –> Support –> Check database status

If you are using an earlier version of POPFile, you can download the POPFile SQLite Database Status Check separately (175 KB zip file). This program can be run from any folder and it should be able to find and check the POPFile database automatically (all you have to do is run the program, no user input is required).

B. Use the “Check database status” shortcut which the 0.22.3 (or later) Windows installer creates in the 'User Data' folder. This shortcut uses the POPFile SQLite Database Status Check program to check the database in the 'User Data' folder.

C. Use the “Run SQLite utility” shortcut which the Windows installer creates in the 'User Data' folder. This shortcut runs the SQLite utility and tells it where to find the POPFile database. When the DOS-box appears, follow the cross-platform instructions below from step 4.

D. Run the SQLite command-line utility manually by changing to the 'User Data' folder (the folder where the POPFile data is stored, i.e. the directory containing the popfile.db file) and following the cross-platform instructions below from step 3. (If the 'User Data' is not in the same folder as the POPFile program, you'll need to specify the path to the sqlite.exe program - if you are not using Win9x then %POPFILE_ROOT%\sqlite can be used)

The Windows installer creates a Start Menu shortcut which can be used to display the location of the 'User Data' folder (Start –> Programs –> POPFile –> Support –> PFI Diagnostic utility (simple)). Starting with the POPFile 0.22.1 release there is another Start Menu shortcut which can be used to make it easy to access the 'User Data' folder (Start –> Programs –> POPFile –> Support –> Create 'User Data' shortcut).

Cross-platform users can check the SQLite database as follows:

1. open a command prompt (DOS box) or shell

2. switch to the directory where your POPFile data is stored, i.e. the directory containing the popfile.db file

3. open the database with the SQLite utility

  • sqlite popfile.db
  • 'SQLite version 2.8.12'
  • Enter “.help” for instructions
  • sqlite>

4. Now run an integrity check (don't forget the semicolon at the end of the command)

  • sqlite> pragma integrity_check;
  • ok
  • sqlite>
  • If there's a problem, you won't see the 'ok' in the example above (if the database is very big it may take more than a few seconds to check it so the 'ok' will not appear immediately).

5. Enter the command .q to exit from the utility.

  • WARNING: Be sure to specify the location of the POPFile database correctly in step 3 - if the popfile.db file is not in the current directory the utility will simply create a new database called popfile.db, check this new (empty) database and report “ok”. The SQLite utility will not warn you that it has created a new popfile.db file in the current directory!

Dealing with the Corruption

Once a popfile.db is corrupted, your choices to rectify the situation are:

  • Shutdown POPFile and delete the corrupt popfile.db file. When you restart POPFile, a new database will be built. This choice means you loose your current corpus and you must re-train POPFile.
  • Fall back to a backup copy of the corpus file(s). You do have backups, right?
  • If you upgraded from a prior version of POPFile, fall back to the backup corpus that was made by the Windows installer. That backup will be located in the backup\oldsql directory.
    1. Delete the popfile.db file in your user folder.
    2. Copy the backup popfile.db to your user folder.
    3. restart POPFile
  • If you are technically inclined, you can attempt to recover the contents of the corrupt corpus using the SQLite to output the database to a text file then import it back.
  • sqlite popfile.db .dump >popback.sql
  • rename popfile.db corrupt.db
  • sqlite popfile.db < popback.sql

This may take some time depending on your corpus size, be patient and do NOT abort or kill POPFile or you will simply create another corrupt corpus.

  • Now that you have seen how much corruption was in the database, you have to make a choice. Either delete the bad buckets entirely, or go with whatever words from the buckets that the utility was able to recover. There is no right answer here, neither choice is very attractive since no matter which you choose, you will suffer data loss of the missing words. You should use your own judgement in deciding what is best for you. In making your decision, consider the fact that if a lot of data was lost from a bucket, and you decide to retain what was recoverable instead of starting over and re-training POPFile, you may create an unbalanced corpus situation that negatively impacts classification accuracy. In many cases, it is easier and quicker to start over than to recover from an unbalanced corpus.

Old Versions

For trouble upgrading from older version 0.20.x only see Corpus Corruption 0.20.x.

 
troubleshooting/corruptcorpus.txt · Last modified: 2012/08/26 00:16 by 127.0.0.1

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License