Open Discussion → Word table of top 100 distinct words

Word table of top 100 distinct words

Is there a way to get a list of the top 100 distinct words in a particular bucket, sorted by the number of occurrences for each word?

I'm assuming there's no function for this currently in the UI. I do know SQL, so that may be the only possible way. Any suggestions on how to do this? Thanks.

  • Message #1726

    I think Scott Leighton's "Top Ten" script (written in Perl) will do what you want, or at least point you in the right direction.

    Creates an HTML report showing the Top X words in each corpus bucket ranked by Simple Word Count and by Probability. Accepts a command-line option to configure the Top number of words to display (the default value is 10).

    POPFile topten Utility (Enhanced Version)

    Sample report (showing the top 50)

    Although the script was written over 9 years ago it only needs a tiny change to make it work properly with the current release of POPFile. Here is the revised script I use:

    and here is a complete list of the changes I made:

    > # Revised by Brian Smith:
    > #         Feb 19, 2011 - Updated to work with v 0.22.0 or later ('skins' structure changed)
    <     if ( open FILE, "<$root" . 'skins/' . $config{'html_skin'} . '.css' ) {
    >     if ( open FILE, "<$root" . 'skins/' . $config{'html_skin'} . '/style.css' ) {

    [Edited to fix the copy-and-paste error that turned my initial reply into garbage!]

    [Edited (again) to fix the garbled attempt at fixing things]

    • Message #1728

      thank you Brian! I'll take a look at it.