Documentation Problems → Training

Training

In topic "When will POPFile be trained?" it says that training will only happen when user reclassified a message.
And in "How long will it take until POPFile will reach a decent accuracy?" i saw that PopFile? needs about 500 messages to reach very good accuracy.

I have following question:

PopFile? needs this messages to be classified or reclassified? I mean its pretty difficult to manually choose and click all this messages. (I'm not consider IMAP experimental module, where it supposedly happens on moving messages between IMAP folders).

  • Message #1390

    When you first start using POPFile it does not know how to classify your mail. You need to reclassify some mail to each of the buckets you have created.

    For example if you create one bucket for spam and another for good mail then you must reclassify some mail to each of these buckets in order to teach POPFile the difference between spam and good mail. Once you have done this POPFile will start to learn how to classify your mail.

    The reference to achieving very good accuracy after about 500 messages does not mean you need to reclassify every one of the first 500 messages. It just means that after POPFile has processed about 500 messages many users find the accuracy is over 95%.

    The How long will it take until POPFile will reach a decent accuracy? page has further details, including a footnote showing how quickly POPFile can learn (out of 4,000 messages I only had to reclassify a total of 36 messages to achieve 99.1% accuracy).

    Some users have configured POPFile to report some simple statistics and these are summarised on the Real-Time Statistics page (updated every two hours)

    Brian

    • Message #1391

      Oh really i see the footnote. Haven't noticed it, my bad.
      But I'm curious now - what does classification model get from all messages, that only classified (not reclassified)? Why do i need that big amount of messages?

      (out of 4,000 messages I only had to reclassify a total of 36 messages to achieve 99.1% accuracy).

      For example, if you get and reclassify only this 36 messages, than will be the other 3,964 messages classified the same way with 99.1% accuracy? (assuming they will come later)

      PS: Sorry for dumb questions and bad English.
      Thanks for fast reply.
      I highly appreciate that.

      • Message #1392

        PS: Sorry for dumb questions and bad English.

        There is no need to apologise. Some things are not explained clearly in the documentation. If you have trouble understanding my replies let me know.

        But I'm curious now - what does classification model get from all messages, that only classified (not reclassified)? Why do i need that big amount of messages?

        Sorry, I am not sure what you mean here.

        POPFile can give good results after only a few messages have been processed. After about 100 messages had been processed my POPFile accuracy was around 75%. I did not have to reclassify all of those messages.

        For example, if you get and reclassify only this 36 messages, than will be the other 3,964 messages classified the same way with 99.1% accuracy? (assuming they will come later)

        I think I have managed to confuse you. I will try to explain it in more detail.

        (1) I installed POPFile and created some buckets.

        (2) When I checked for new email POPFile marked the messages as "unclassified". This is because POPFile has no idea what type of message belongs to each of my buckets.

        (3) Using the HISTORY page I started to reclassify these messages. This step teaches POPFile the kind of message that belongs to each bucket. If step (2) downloaded 100 messages there is normally no need to reclassify every one of these messages.

        For example once you have reclassified a message to a particular bucket there is often no need to reclassify a similar message to the same bucket.

        POPFile's "Single Message View" can be use to check if POPFile has learned enough to be able to correctly classify the message. If an "unclassified" message would now be correctly classified then there is no need to reclassify it.

        (4) The next time you check for new mail POPFile should make fewer mistakes because it is starting to learn how to classify your mail. You need to reclassify some mail to every bucket to teach POPFile how to classify your mail.

        (5) As POPFile learns how to classify your mail, the number of times you need to reclassify messages should reduce.

        The footnote on the wiki page tries to demonstrate this. Here is another way to look at the data in the footnote:

        | Total Number of messages | Reclassification Total | Accuracy |
        +--------------------------+------------------------+----------+
        |        1,000             |          21            |   97.9%  |
        |        2,000             |          28            |   98.6%  |
        |        4,000             |          36            |   99.1%  |
        

        I did not wait until I had received 1,000 messages before I started to reclassify them. Each time I received mail I checked if any messages needed to be reclassified.

        In the first 2,000 messages received I had to reclassify a total of 28 messages but in the next 2,000 messages only 8 had to be reclassified.

        Brian

        • Message #1393

          Thank you a lot for such detailed answer.
          Now i completely understand.

          • Message #1394

            Glad I was able to help. Hope you find POPFile useful.