Open Discussion → Noticeable increase in unclassified tags for one user

Noticeable increase in unclassified tags for one user

I have been running POPFile for years and highly recommend it. I am posting because I have recently seen a noticeable increase in the number of unclassified messages. I POP mail off of 12 users hosted mailboxes and if a very similar message is sent to multiple users, the message to one user will be tagged unclassified while all of the others are tagged as spam. This occurs often but is a recent problem. Does anyone have any explanations for this behavior?

FWIW, the database status checks out ok. The popfile.db file size is 13,651KB. The user experiencing the unclassified problem has the highest volume of mail.

  • Message #1943

    There is no simple, straightforward answer to your question.

    How many POPFile buckets have you defined ?

    How are you handling the mail for each user (one group of buckets per user or all users share the same buckets or some other scheme) ?

    Have you used "Single Message View" to get more information about why the messages were marked as "unclassified" ?

    Have you made any recent changes to POPFile's configuration ? Some changes can have an adverse effect upon POPFile's accuracy - see the "Is it possible to 'unbalance' your corpus?" page for some examples.

    I've also found that deleting the stopwords list can have a dramatic effect upon accuracy (I had to delete my corpus and start training from scratch to get back to my normal accuracy).

    Other factors that affect accuracy include the message length and excessive use of magnets (see "What variables affect 'good' statistical ratings?" for further details).

    • Message #1995

      Sorry for the delay in responding...

      I have 2 buckets, 7 magnets.

      All users share the same buckets.

      The only thing I can determine from the single message view is that the messages are being classified as "unclassifed" is that the scores for the two buckets are very close to each other. Part of the puzzle that I don't understand is why the same message can come in to multiple users and be correctly tagged as spam for all but one of the users.

      How would I go about rebuilding the corpus?

      Finally, how does the "stopwords" list fit into all this? A review of the list contents does not give me any insights but I see numerous non-word entries. I also see that the stopwords file is the same as the stopwords.default file so it appears to have never been modified.

      • Message #1996

        The only thing I can determine from the single message view is that the messages are being classified as "unclassifed" is that the scores for the two buckets are very close to each other.

        POPFile marks messages as "unclassified" when the difference between the bucket probabilities is "not very big". This limit is one of the configuration parameters you can change using the ADVANCED page in the POPFile UI.

        If you find that the Single Message View for "unclassified" messages usually shows that the correct bucket has a higher probability than the other bucket then you can try changing the bayes_unclassified_weight setting to reduce the number of "unclassified" messages.

        Part of the puzzle that I don't understand is why the same message can come in to multiple users and be correctly tagged as spam for all but one of the users.

        This is not surprising. For example, when POPFile analyses a message it takes into account the email headers which are normally hidden by email clients. You can see these extra headers in Single Message View - there can be 40 or more lines of data before the actual text of the message appears. Some of these lines mention the email servers which handled the message and will therefore vary according to the account which received the message.

        • Message #1998

          Thanks Brian. I am thinking it might be time to rebuild the corpus as there are too many obvious spam messages getting through. Word counts look awfully high as well.

          newmail 446,111 (22.30%)
          spam 1,553,966 (77.69%)

          How do I go about rebuilding it?

          • Message #1999

            POPFile uses a database to hold the corpus and magnet data. By default this is a SQLite database and everything is stored in a single file called popfile.db.

            As you only have 2 buckets and 7 magnets I think the easiest way to rebuild the corpus is to

            (1) Make a note of the settings for all of the magnets (so you can easily create them).

            (2) Shut down POPFile.

            (3) Rename the database file (e.g. old-popfile.db) or move it to another folder. This will make it easy to revert back to this database.

            (4) Re-start POPFile.

            POPFile will create an empty database with no magnets and only the special "unclassified" bucket. You can then create the buckets and any magnets you need.

            Be careful with magnets as they can have an adverse effect upon POPFile's accuracy. I've never used any magnets. See What is a magnet? and Setting up magnets.

            You will need to classify some mail to each of the buckets you created in order to teach POPFile the type of email that belongs to each bucket.

            You may be surprised at how quickly POPFile learns how to classify your incoming mail.

            The "How long will it take until POPFile will reach a decent accuracy?" page in the online manual includes some statistics compiled while I was training POPFile to handle my mail.

            There is no need to reclassify every message that POPFile has failed to classify correctly.

            For example if several similar spam messages were not marked as 'spam' it is often only necessary to reclassify one of these as 'spam'.

            When you reclassify a message POPFile updates the data it uses to classify messages. This will not change the current classification of the other similar messages because they've already been passed along to the mail client, but will affect future messages.

            You can see if a message would now be classified differently by using "Single Message View". Simply click on the message's "Subject" entry in the HISTORY page of the POPFile UI and POPFile will display the message together with the name of the bucket used to classify the message. If POPFile reports that the message would now be classified correctly then there is no need to reclassify this message.

            • Message #2000

              Great information. Thank you.

              One thing remains confusing. My original post was about similar messages sent to multiple recipients were being tagged as spam for all recipients except one. The explanation was that the headers are taken into consideration during classification. It is difficult to imagine that a different To: header affects the classification but it obviously does which leaves me looking for additional clarification regarding the following statement in your last response.


              " There is no need to reclassify every message that POPFile has failed to classify correctly.

              For example if several similar spam messages were not marked as 'spam' it is often only necessary to reclassify one of these as 'spam'."

              Is this statement applicable when POPFile is processing for multiple users?

              • Message #2003

                It is difficult to imagine that a different To: header affects the classification

                POPFile does much more than just look at the "From:", "To:" and "Subject:" lines in the message headers. Have another look at Single Message View and compare the other header lines. You can also scroll down the page and find various links that will display much more information about how POPFile reached its decision.

                Is this statement applicable when POPFile is processing for multiple users?

                You have not given much information about how you are handling the email (I did ask earlier...).

                If you use Single Message View and a previously classified message would now be classified differently (as a result of some reclassifying performed since the message was originally classified) then POPFile will inform you.

                • Message #2004

                  I am using POPFile in conjunction with Mercury/32 and the POPFileD daemon. Mercury POPs from 12 domain hosted mailboxes. I am wondering what the best way is to reclassify messages that appear identical except for the To: field. Is it enough to reclassify just one of them or should each one be reclassified? This question is specific to retraining (a new db).

                  • Message #2005

                    what the best way is to reclassify messages that appear identical except for the To: field. Is it enough to reclassify just one of them or should each one be reclassified?

                    When I was faced with a situation like this I used the following procedure:

                    (1) Use a filter to display only the "similar" messages on the HISTORY page (as this makes things much easier)

                    (2) Reclassify one of these filtered messages (if their sizes vary a lot pick one of the larger messages)

                    (3) Use "Single Message View" to examine the remaining filtered messages one at a time and if POPFile does not report that the message would now be classified correctly then reclassify it.

                    This is the procedure I used to obtain the statistics quoted on the How long will it take until POPFile will reach a decent accuracy? page.

                    • Message #2006

                      Will give rebuilding the db a go first chance I get. Thank you so much for the assistance.

                      • Message #2136

                        The only thing I can determine from the single message view is that the messages are being classified as "unclassifed". Part of the puzzle that I don't understand is why the same message can come in to multiple users. You can see these extra headers in Single Message View - Some of these lines mention the email servers which handled the message and will therefore vary according to the account which received the message.
                        Thanks
                        Maria
                        http://emicalculators.in/personal-loan-emi-calculator/