Documentation Problems → Statistics clarification

Statistics clarification

It is not clear if the "Classification Count" on the "Buckets" tab in the Control Centre is the count before or after reclassification. It would be useful that this be mentioned in the title.

As for the Classification Accuracy, it presently shows 95% after a few days of learning. Not too bad, one would think, but I have had 24 false positives in the spam basket against 134 in the non-spam classification count so if I calculate 24/134, I get the result that 18% of non-spam e-mails have been wrongly classified as spam and only 82% of them correctly classified. The non-spam e-mails are what I care for, not if 99% or 95% of spam has been weeded out. I therefore think that the Classification Accuracy - if not accompanied by a percentage like this - is somewhat misleading, as it could make the casuel user think that only 5% of non-spam e-mails are wrongly classified.

  • Message #368

    It is not clear if the "Classification Count" on the "Buckets" tab in the Control Centre is the count before or after reclassification

    This is explained in the wiki page describing the UI's BUCKETS page ("Classification Accuracy box. This lists the number of messages classified, the number of classification errors, and the resulting accuracy (in percent)").

    The Classification Accuracy is calculated using the "Classification errors" and the "Messages classified" values.

    For example my current statistics are:

    Messages classified: 11,704
    Classification errors: 41
    Accuracy: 99.64%

    So far 11,704 messages have been classified and I had to correct 41 of those classifications so only 11,663 messages were correctly classified by POPFile. Therefore the accuracy is ((11,663/11,704)*100) percent.

    Brian

    (edited to fix the missing " character)

    • Message #370

      Brian,

      The wiki doesn't say if the classification count shows numbers before or after reclassification.

      I understand how the accuracy is calculated. What I say is that the way it is calculated seems less relevant than looking at how many good e-mails went into the spam box by mistake. Let's say I hadn't looked into the spam basket at all. I'd have lost 18% of my good e-mails then, which is only an accuracy of 82%. The way it's calculated now means that the more spam you receive, the more accurate it will look, as it's not too difficult to identify the majority of spam, even if a not insignificant amount of good e-mails go into the spam basket. The present accuracy percentage is actually quite useless to evaluate if you get all the e-mails you want. The only thing a high accuracy in the present form really tells you is that you receive vast amounts of spam, but you knew that already.

      • Message #372

        The wiki doesn't say if the classification count shows numbers before or after reclassification.

        I thought the wiki's explanation was good enough but if you care to suggest a better description then it'll be considered - that's why we have this forum.

        What I say is that the way it is calculated seems less relevant than looking at how many good e-mails went into the spam box by mistake.

        I think you have misunderstood what the "Classification Accuracy" value represents. POPFile is a general purpose email classifier therefore this value only gives an indication of how well POPFile is performing. It does not indicate how good POPFile is at classifying spam - you need to look at the "Messages Classified" table to get that information.

        My POPFile installation uses seven buckets in addition to the special "unclassified" bucket. I only have one bucket for spam but some users have several buckets so they can keep track of the different types of spam they receive.

        A few years ago John proposed using Hit Rate/Strike Rate data to give a better indication of how well POPFile is performing and some patches were produced to implement this idea. I was interested in this so I customised my installation to use this idea, as shown in this screenshot.

        The present accuracy percentage is actually quite useless to evaluate if you get all the e-mails you want.

        I don't agree - I find this statistic very useful as a quick check on how well POPFile is handling my mail.

        Brian

        (edited to remove duplicated link)

        • Message #375

          Well, before I can suggest an improved description, I would need to know if the figures are before or after manual reclassification, and since the documentation doesn't say that...

          I am not that stupid that I have misunderstood what Classification Accuracy means. I repeat that I simply find that the present calculation is not very useful. Popfile may be a general purpose classification tool, but it is hardly any secret that its main purpose is to act as a spam filter. Strangely enough, "spam" is one of the default baskets, and that one uses red colour. It would not seem unreasonable to take into account that the vast majority of e-mails are spam and therefore useless. If all e-mails were equally useful, the present accuracy calculation would be just fine, but as spam constitute the bulk of e-mail and as spam is not useful, the accuracy statistics that concern the few e-mails one wants to receive is drowning in the spam statistics. You are unable to tell from the present accuracy percentage how much of the 'good' e-mail is going to the spam basket. From a purely mathematical-statistical view, you can defend calculating global statistics regardless if the e-mail is useful/good or not useful/bad, but that is to overlook that fact that a wrongly classified "good" e-mail can have much more important consequences than a badly classified spam. You could lose a $2000 business order in the former case whereas you can just delete the one spam manually in the latter case. It would enable Popfile to provide more useful statistics if a concept of "wanted" and "unwanted" e-mail were introduced when defining a bucket. The user could then define that (s)he only wants statistics on "wanted", for example, or - if the user prefers the egalitarian approach - stay with global stats. I'm aware that this would mean a programming change and I can't tell how much effort would be needed. This easy part is pointing out what could be improved. This is not intended as criticism but as inspiration for improvement. The present behaviour is not wrong. You just can't use the accuracy stats to find out how much wanted e-mail you are losing.

          • Message #378

            Hi

            You mean you want to use 'false positives in the unwanted bucket' per
            'message count in the wanted bucket' as classification accuracy?

            I think we can implement that but I don't know if it's worth doing.

            In my case (I'm using POPFile for years and have reset the statistics on
            June 14th), the statistics are:

            Messages classified: 19,353
            Classification errors: 136
            Accuracy: 99.29%

            And the accuracy calculated by your method:

            False positives in the unwanted bucket (spam): 32
            Message count in the wanted bucket (inbox): 16,109
            Accuracy: 99.80%

            Naoki

            • Message #381

              Hi Naoki,

              Yes, that's what I suggest (although I'd put buckets in plural so that it becomes "false positives in any unwanted bucket per message count in all the wanted buckets"). It would mean introducing a few field for each bucket, classifying the buckets as either goodies or baddies.

              I would not suggest that it replace the present accuracy percentage but that it supplements it.

              If it's worth doing is not for me to say.

              I note that there is no significant difference in your case. After a few days or learning, the difference is very significant here. I suppose the difference will be reduced with time and learning (if not I can just as well delete spam manually in my e-mail client). However, I regularly receive e-mail in three different languages and I don't know if the effifiency is identical for all languages.

        • Message #376

          A few years ago John proposed using Hit Rate/Strike Rate data to give a better indication of how well POPFile is performing and some patches were produced to implement this idea. I was interested in this so I customised my installation to use this idea, as shown in this screenshot.

          I like that statistics table. Do you remember why we didn't go for it?

          Naoki

          • Message #379

            I like that statistics table. Do you remember why we didn't go for it?

            Yes.

            There were two main problems:

            (1) POPFile does not know which bucket is the 'spam' bucket because the user is not forced to have a bucket called 'spam' (e.g. a user might use 'junk' for the bucket name)

            (2) The user might have more than one bucket for spam (e.g. spam-drugs, spam-419, spam-casino, spam-software) and this would make it a bit harder to compute the Hit Rate and Strike Rate.

            A long time ago I suggested using some extra entries in popfile.cfg to identify the name of the spam bucket (or buckets, if there is more than one) but I've not had time to do any work on this idea.

            Brian

            • Message #388

              There were two main problems:

              (1) POPFile does not know which bucket is the 'spam' bucket because the user is not forced to have a bucket called 'spam' (e.g. a user might use 'junk' for the bucket name)

              (2) The user might have more than one bucket for spam (e.g. spam-drugs, spam-419, spam-casino, spam-software) and this would make it a bit harder to compute the Hit Rate and Strike Rate.

              Thanks.
              I thought that the 'Hit rate' and 'Strike rate' were calculated per bucket.

              A long time ago I suggested using some extra entries in popfile.cfg to identify the name of the spam bucket (or buckets, if there is more than one) but I've not had time to do any work on this idea.

              POPFile v2 will have multi-user mode.
              So if we are to implement the feature, we should add several new fields to the database.

              POPFile currently collects the false positives and the false negatives per bucket.
              This means that the false positives and the false negatives will increase even if POPFile
              mis-classifies a message to the bucket in the same group.
              I think we need to add at least three fields such as 'bucket group', 'false positives per bucket group'
              and 'false negatives per bucket group'.

              Naoki

              • Message #389

                I thought that the 'Hit rate' and 'Strike rate' were calculated per bucket.

                Yes, that is what the current code does. I cannot claim the credit, Jim Lang made the original patches for POPFile and I just made a few adjustments to suit my needs.

                I only have one 'spam' bucket so it is easy for me to compare the performance of my POPFile installation with the Hit Rate/Strike Rate statistics John provided in the league tables.

                However if more than one bucket is used for spam-type messages then it becomes a bit harder to generate some Hit Rate/Strike Rate data than can be used to compare performance against the data in the league tables.

                So if we are to implement the feature, we should add several new fields to the database.

                That is why I suggested using popfile.cfg - it would not involve changing the database. I thought that approach would make it easier for users to see the new Hit Rate/Strike Rate in action. Then if people were interested we could do it properly by adding extra information to the database.

                Before my old PC suddenly stopped working I had patched POPFile so it could either display the current "Messages Classified" statistics or the new Hit Rate/Strike Rate statistics using simpler code that avoids doing the statistics calculations twice.

                I still have the hard disk from that PC so I hope I can retrieve my modified code and update it. All I need to do is find the time to do this!

                Brian

                • Message #553

                  I still have the hard disk from that PC so I hope I can retrieve my modified code and update it. All I need to do is find the time to do this!

                  Progress report:

                  I've retrieved my old Hit Rate/Strike Rate files and updated them for use with the POPFile 1.1.0 files.

                  POPFile's skins have changed a bit since I made my original skins but I have not yet spent any time trying to bring my modified skin files up to the new standard. So there is still some work to be done :-)

                  Here is a screenshot showing the Hit Rate/Strike Rate table and another screenshot showing my revised version of the POPFile's default simplyblue skin.

                  Brian

                  • Message #554

                    Hi, Brian

                    Here is a screenshot showing the Hit Rate/Strike Rate table and another screenshot showing my revised version of the POPFile's default simplyblue skin.

                    Looks good.
                    I'd like to include them into POPFile v1.1.0.

                    Naoki

                    • Message #555

                      I'd like to include them into POPFile v1.1.0.

                      I'm not sure they are good enough to be included. The skin templates and style sheets need more work but I've not had time to do this (e.g. the "ocean" skin has not been updated to work with the new table layouts).

                      One of the problems I have is that I do not know much about making skins. Instead of putting my changes into SVN I can provide a zip file or Windows installer which uses my new code. This may take me a couple of days - my internet connection is not very reliable at the moment.

                      Brian

                      • Message #557

                        Instead of putting my changes into SVN I can provide a zip file or Windows installer which uses my new code.

                        I've uploaded the files now: cross-platform version (524 KB) and Windows installer version (5.18 MB)

                        Please note that this is not a release candidate - it is an experimental version demonstrating the Hit Rate/Strike Rate and some other changes I've made to my POPFile installation.

                        I have not had time yet to check all of the skins or even check everything has been ported correctly from 0.22.5 (I've only just updated my main installation from 0.22.5!)

                        Brian

                • Message #404

                  Yes, that is what the current code does. I cannot claim the credit, Jim Lang made the original patches for POPFile and I just made a few adjustments to suit my needs.

                  I see. I've looked through the old forum posts.

                  I only have one 'spam' bucket so it is easy for me to compare the performance of my POPFile installation with the Hit Rate/Strike Rate statistics John provided in the league tables.

                  However if more than one bucket is used for spam-type messages then it becomes a bit harder to generate some Hit Rate/Strike Rate data than can be used to compare performance against the data in the league tables.

                  Yes. But I think the most of the users has only one spam folder.
                  Hit Rate/Strike Rate par bucket may be useful for the standard users.
                  I like this approach.

                  That is why I suggested using popfile.cfg - it would not involve changing the database. I thought that approach would make it easier for users to see the new Hit Rate/Strike Rate in action. Then if people were interested we could do it properly by adding extra information to the database.

                  I agree that using popfile.cfg is easier than changing database.
                  But POPFile v2 will support the multi-user mode.
                  In v2, the configurations for users are stored in the database.
                  So I think we should change the database if we are to implement
                  this feature.

                  In addition, we have to collect false positives/false negatives
                  per 'bucket group' to calculate Hit Rate/Strike Rate per bucket
                  group.
                  When you reclassify a message from 'spam-a' to 'spam-b', the
                  Hit Rate/Strike Rate should not change.
                  The false positives/false negatives per bucket change even if you
                  classify messages between the buckets in the same group.

                  Naoki

            • Message #382

              Adding the category "good" or "bad" for all buckets would solve the problem of not knowing which bucket or buckets are for spam. Should there be a "neutral"? I don't think so. It's use is likely to be marginal and it would just clutter up the image. It would also respect the principle of Popfile being a general tool in that it doesn't specifically deal with the concept of spam but introduces the general concept of an e-mail being desired or not.

              At a longer term, if someone should wish implementing it, it would allow the introduction of a user setting (1-10 for example) for how aggressive Popfile should be when eliminating baddies. Some people may prefer to get rid of nearly all baddies regardless if a few goodies are badly classified, while others may prefer to be sure they get all the goodies regardless if some baddies should sneak in too, i.e. 'if in doubt, put it in a goodies basket'.

        • Message #377

          Sorry, didn't comment on your screenshot. The hit rate per bucket seems to be what I'm looking for if it means e-mails automatically placed in that bucket and not manually reclassified. But I guess this requires programming, not just configuration. What is the strike rate? As hit rate and strike rate don't add up to 1 for all buckets, it logically can't be e-mails for that bucket that were wrongly placed in another bucket.

          • Message #380

            What is the strike rate?

            John described this in one of his Anti-Spam newsletters. You can get the PDF version from his website and also see data for other spam filters: http://www.jgc.org/astlt.html

            The percentage values are displayed to 4 decimal places in the UI so rounding or truncation will have an effect (I'd need to look at the code to see how these values are output).

            Brian

            (edit - I meant to say "The percentage values are displayed to x decimal places in the UI so rounding or truncation will have an effect (I'd need to look at the code to see how these values are output)." rather than specifically refer to the format of the Hit Rate and Strike Rate values)