Ticket #108 (closed defect: fixed)

Opened 3 years ago

Last modified 3 months ago

Reclassify is much slower in 1.1.0 than it is in 0.22.5

Reported by: brian Assigned to: brian
Priority: high Milestone: 1.1.1
Component: Database Version: 1.1.0
Severity: normal Keywords: reclassify database
Cc:

Description (Last modified by brian)

The Slower reclassifications topic in the HELP OPEN DISCUSSION forum complains that reclassify operations which used to take around 6 seconds with 0.22.5 now take around 47 seconds after upgrading to 1.1.0.

The SQLite database involved is large (0.22.5 size was 102,807 Kb, after the upgrade to 1.1.0 the size is 50,735 Kb)

I've confirmed this behaviour using some test data:

0.22.5 with 150 MB database performs reclassifications in around 9 seconds per message (tested using message sizes up to 99.9 KB)

After upgrading to 1.1.0 RC5 the reclassify times were around 19 seconds (similar size messages used)

After upgrading to current SVN (getting ready for 1.1.1 RC2) the time increased to over 26 seconds or even over 30 seconds. Changing from "delete" to "memory" for the SQLite journal mode reduced the times by only a few seconds (e.g. 24 instead of 26 seconds).

I have just noticed that the ADVANCED page in 1.1.1's UI still shows bayes_sqlite_tweaks even though the Option Reference page no longer lists this parameter. If I shutdown POPFile and delete this line from popfile.cfg it gets reinstated the next time I start POPFile. I thought these tweaks were all to do with tweaking SQLite 2.x - are any of them still relevant now that we are using SQLite 3.x?

[Edited to correct the name of the forum]

Change History

06/21/09 01:47:52 changed by brian

  • description changed.

06/21/09 16:39:48 changed by brian

  • status changed from new to assigned.
  • owner set to brian.

I've run some tests using ActivePerl? 5.8.9 build 826 to run popfile.pl (instead of using the normal minimal Perl). Here are some extracts form my log with logger_level set to 2:

POPFile 0.22.5 code:

2009/6/21 14:55:24 7396: bayes: 472: Reclassification from unclassified to spam
2009/6/21 14:55:25 7396: history: 404: Change slot classification of 2772 to spam
2009/6/21 14:55:31 7396: history: 1003: Base query is select XXX from ...

POPFile 1.1.1 code before Naoki's patches:

2009/6/21 14:59:31 7172: bayes: 540: Reclassification from unclassified to spam
2009/6/21 14:59:31 7172: history: 411: Change slot classification of 2774 to spam
2009/6/21 14:59:57 7172: history: 1057: Base query is select XXX from ...

POPFile 1.1.1 code with patches [3535], [3536] and [3537] from Naoki:

2009/6/21 15:19:47 5328: bayes: 539: Reclassification from unclassified to spam
2009/6/21 15:19:47 5328: history: 411: Change slot classification of 2769 to spam
2009/6/21 15:19:56 5328: history: 1051: Base query is select XXX from ...

So with these patches 1.1.1 is now almost as fast as 0.22.5.

06/21/09 18:04:57 changed by brian

Here are the results from a more controlled test; the database and message history were restored and then the same two messages were reclassified:

0.22.5 log extract:

2009/6/21 15:53:50 2620: bayes: 472: Reclassification from unclassified to regular
2009/6/21 15:53:51 2620: history: 404: Change slot classification of 2757 to regular
2009/6/21 15:54:00 2620: history: 1003: Base query is select XXX from ...
...
2009/6/21 15:54:29 2620: bayes: 472: Reclassification from unclassified to spam
2009/6/21 15:54:29 2620: history: 404: Change slot classification of 2767 to spam
2009/6/21 15:54:36 2620: history: 1003: Base query is select XXX from ...

1.1.1 with [3535], [3536] and [3537] patches:

2009/6/21 16:01:25 7992: bayes: 539: Reclassification from unclassified to regular
2009/6/21 16:01:25 7992: history: 411: Change slot classification of 2757 to regular
2009/6/21 16:01:34 7992: history: 1051: Base query is select XXX from ...
...
2009/6/21 16:02:11 7992: bayes: 539: Reclassification from unclassified to spam
2009/6/21 16:02:11 7992: history: 411: Change slot classification of 2767 to spam
2009/6/21 16:02:19 7992: history: 1051: Base query is select XXX from ...

Time to reclassify a 99.8 KB message (slot 2757):

0.22.5 = 9 seconds

1.1.1 = 9 seconds

Time to reclassify a 939 byte message (slot 2767):

0.22.5 = 7 seconds

1.1.1 = 8 seconds

So there is not much difference between 0.22.5 and the current 1.1.1 SVN code now.

06/21/09 18:52:54 changed by brian

Naoki's latest patches make 1.1.1 much faster than 0.22.5 now:

0.22.5 log extract:

2009/6/21 15:53:50 2620: bayes: 472: Reclassification from unclassified to regular
2009/6/21 15:53:51 2620: history: 404: Change slot classification of 2757 to regular
2009/6/21 15:54:00 2620: history: 1003: Base query is select XXX from ...
...
2009/6/21 15:54:29 2620: bayes: 472: Reclassification from unclassified to spam
2009/6/21 15:54:29 2620: history: 404: Change slot classification of 2767 to spam
2009/6/21 15:54:36 2620: history: 1003: Base query is select XXX from ...

1.1.1 with the [3538] & [3539] patches:

2009/6/21 17:46:18 2320: bayes: 540: Reclassification from unclassified to regular
2009/6/21 17:46:18 2320: history: 411: Change slot classification of 2757 to regular
2009/6/21 17:46:21 2320: history: 1051: Base query is select XXX from ...
...
2009/6/21 17:46:33 2320: bayes: 540: Reclassification from unclassified to spam
2009/6/21 17:46:33 2320: history: 411: Change slot classification of 2767 to spam
2009/6/21 17:46:34 2320: history: 1051: Base query is select XXX from ...

Time to reclassify a 99.8 KB message (slot 2757):

0.22.5 = 9 seconds

1.1.1 = 3 seconds

Time to reclassify a 939 byte message (slot 2767):

0.22.5 = 7 seconds

1.1.1 = 1 second

06/21/09 23:56:10 changed by brian

POPFile 1.1.1 RC2 uses patches [3535], [3536], [3537], [3538] and [3539] to improve performance.

06/23/09 14:57:11 changed by brian

  • status changed from assigned to closed.
  • resolution set to fixed.

Naoki's new code in POPFile 1.1.1 RC2 has made a dramatic improvement in POPFile's performance, as reported in messages 866 and 870 in the Open Discussion forum. This ticket can now be closed.