Open Discussion → popfile creating TAGS

popfile creating TAGS

Hi,
I am using popfile as a server to classify my pdf-(and other) documents. This works like a charm.
I am now looking into the guts of the bayes classifier to see what is needed, so that I can use it to auto generate tags on my documents.
The idea is, that I pick the n- most probable buckets and use these as my tags.

The following issues I see upcoming:

Would the probability index work?
Assuming I set a minimum probability and return every bucket that is above.
What happens if I then (or when the user selects/deselects tags) add/delete those word-counts in the matrix?

Would the (db & perl) lookup scale for 10's to 100's of buckets (and a word-index with three or more languages)?

Does anyone has any experience (or even a better solution) for creating tags for documents?

Would that be work usable by anyone else?

thanks for any response.
thilo

  • Message #1905

    What happens if I then (or when the user selects/deselects tags) add/delete those word-counts in the matrix?

    I am not sure what you mean. The "Is it possible to 'unbalance' your corpus?" page in the manual may be relevant.

    Some changes can adversely affect POPFile's accuracy. For example I started using POPFile with its default 'stopwords' list to classify my mail into about 6 or 7 buckets and after a few years decided to use an empty 'stopwords' list - this resulted in so many classification errors on the type of mail that had been correctly classified for years that I had to delete the corpus and start again with an empty corpus and an empty 'stopwords' list. It did not take long for POPFile to achieve its usual high accuracy (see "How long will it take until POPFile will reach a decent accuracy?").

    Would the (db & perl) lookup scale for 10's to 100's of buckets

    There is, effectively, no maximum number of buckets. Some users have configured over 200 buckets - we know this because these users have enabled the option to send a daily statistics report. By default POPFile does NOT report any statistics.

    Each report consists of three values (the total number of buckets, the total number of messages that POPFile has classified and the total number of classification errors) and is sent once per day.

    These statistics are summarised on the POPFile Real-Time Statistics page.

    and a word-index with three or more languages

    POPFile can handle different languages; for example it can handle Japanese text in addition to Western languages.