Open Discussion → clean_corpus.pl

clean_corpus.pl

There's a script named clean_corpus.pl at http://getpopfile.org/docs/utilityscripts. I understand the stated purpose of the script per the documentation, but I'm wondering about its applicability and compatibility with the current 1.1.0 version of PopFile?. Is the current thinking that there is no need to clean the corpus or is that sort of culling based on date or frequency of occurrence or some other criteria now integral to the script? In order words, should I be considering running this script for any reason at all?

  • Message #778

    Thank your for asking.

    The only reason I can think of of running clean_corpus today is purely scientific: would it run at all? Is it still compatible with POPFile? Is there a measurable performance benefit?

    I doubt that each of those questions can be answered with "yes". clean_corpus was written when POPFile's corpus was not yet stored in a database but in plain text files that were read when POPFile started up. Thus cleaning dirt from the corpus was beneficial for POPFile start-up time (which was really bad back then when your corpus was large) and for POPFile's memory consumption. Both of these are no longer an issue, as far as I know.

  • Message #779

    About 20 months ago I used clean_corpus.pl's "probability" mode to remove 56 words which had similar probabilities in all 7 of my buckets ... and found that POPFile's accuracy went down!

    After about three weeks I got fed up with having to reclassify messages that used to be handled correctly so I gave up and restored the corpus from the backup made before clean_corpus.pl modified my corpus. This made an immediate improvement in POPFile's accuracy (i.e. the number of reclassifications went down).

    Brian

    • Message #780

      Thank you both Brian and Manni,

      It certainly sounds as if the implicitly recommended approach is:

      1) Leave the corpus alone,
      2) Don't try to clean or prune it, and
      3) Don't worry about how large the corpus may grow.

      Correct?

      • Message #781

        2) Don't try to clean or prune it

        I cannot remember the last time I tried cleaning the corpus. The only maintenance I remember doing in the last few years is to defragment the corpus (about 2 or 3 times in the last four years).

        Some POPFile version upgrades do an automatic defrag of the corpus anyway when they are installed, so I stopped doing it manually. I have noticed that when I do check for fragmentation the corpus no longer seems to get fragmented as much as it used to. This is one reason why I no longer defragment it manually.

        However one thing I do frequently is check the integrity of my corpus and make a backup copy of it.

        3) Don't worry about how large the corpus may grow.

        How big is your corpus? The Windows installer creates a "Check database status" shortcut that will check the integrity of the corpus and report its size (you'll need to shut down POPFile before you use this shortcut).

        Some ad hoc tests I ran a long time ago suggested that POPFile may start up slowly when the corpus is very large (i.e. over 100 or 150 MB?). However I've not run any tests since POPFile switched to using SQLite 3.x format databases.

        Brian

        • Message #782

          Brian,

          My SQLite 3.x corpus is minuscule. I'd read an earlier post also about the degraded performance seen when it got to be large. I'm probably being overly concerned, but I won't know until I've experienced a significant increase in the DB size. Logically, it just seemed like there should be some mechanism in place that would provide a means to reduce the DB size should it become a factor performance wise.

          I'll certainly utilize the DB tool you pointed out.

          Thanks again for all your attention to this inquiry.

          • Message #783

            My SQLite 3.x corpus is minuscule.

            Same here: 2,440 KB at present (this corpus is just over 4 years old now). I've always been careful to keep the size down by only training on errors and ignoring those messages where Single Message View reports the message would now be classified correctly.

            One of the side-effects of the change from SQLite 2.x to 3.x (introduced in the 1.1.0 release) is that the database size was reduced to about a third of its size.

            it just seemed like there should be some mechanism in place that would provide a means to reduce the DB size should it become a factor performance wise.

            One of the reasons for upgrading to SQLite 3.x is that the new file format results in databases that are 25% smaller (depending on content). The new database format also offers better scalability and the [SQLite] code is a little faster too.

            Brian