insert.pl utility script

The insert.pl script provides a way to train your corpus by feeding it sample emails for a particular bucket. Those emails are parsed and internally reclassified to the bucket you specify.

About Sample Size

If you use this script to train POPFile via email samples, be careful about sample size. This is not a recommended way to train POPFile, it is a utility designed for testing. If you intend to use it to train POPFile, we do not recommend you submit thousands of emails to the script, you will end up with a huge corpus that offers little additional benefit to classification accuracy. Your best approach when using this script would be to stick to small representative samples of at most 100 emails per bucket. POPFile learns quickly so using this script is unnecessary and will result in less accuracy in classifying your mail than the recommended TOE method. You may want to look into TrainTest.py which can simulate TOE.

Usage

Shutdown POPFile Before Using Shutdown any running instance of POPFile before you use this script. insert.pl modifies the corpus by adding words to it, it should not be run concurrently with POPFile to avoid damage to the corpus databases.

The script must be run from the POPFile installation directory. Windows users should open a DOS box and switch to the popfile directory (normally c:\program files\popfile\ but it can be different on your system).

   cd "\program files\popfile\"

Once in the popfile installation directory, issue the following to run the program.

Feeding a directory of messages

   perl insert.pl bucketname \path\to\messages\*.*

Feeding a single message

   perl insert.pl bucketname messagefilename

Tips on Obtaining the Sample Emails

Outlook/Outlook Express Users

  1. Create a temporary folder on your hard drive, name it poptemp.
  2. Open the folder
  3. Open your mail client, resize it so you can see the temporary poptemp folder you created
  4. select the email messages in your mail client that you want to comprise the sample.
  5. Drag and drop those selected messages to the poptemp folder

The messages will be placed in the poptemp folder as .eml files. You can feed that folder direct to insert.pl as follows:

   perl insert.pl bucketname \poptemp\*.eml

Eudora (and clients with MBOX or MBX style mail)

  1. insert.pl will work directly with the mbx file created by Eudora and similar mail clients.
  2. Make sure the folder contains only those messages you want to include in the sample
  3. Note the folder name used in Eudora, it determines the file name you will use. For example, a folder named Newsletters in Eudora will have the filename newsletters.mbx
  4. Feed the mbx file to insert.pl as follows;
   perl insert.pl bucketname \path\to\eudora\newsletters.mbx
 
utilityscripts/insert.txt · Last modified: 2008/02/08 19:49 by 127.0.0.1

Should you find anything in the documentation that is incomplete, unclear, outdated or just plain wrong, please let us know and leave a note in the Documentation Forum.

Recent changes RSS feed Donate Driven by DokuWiki
The content of this wiki is protected by the GNU Fee Documentation License