Ticket #141 (new defect)

Opened 7 years ago

Last modified 4 years ago

Accented characters not handled properly by Windows version

Reported by: brian Assigned to:
Priority: normal Milestone:
Component: Database Version: 1.1.1
Severity: normal Keywords: stopwords foreign accents
Cc:

Description

This ticket has been raised in response to the HELP forum topic "(encoding) french accent"

The Windows version of POPFile is happy to add (and remove) accented STOPWORDS via the ADVANCED page in the UI but the equivalent Ubuntu version refuses to accept these words and displays this error message:

Ignored words can only contain alphanumeric, ., _, -, or @ characters

Tested this using the word 'école' with POPFile 1.0.1 on Ubuntu 9.04 (32-bit) and Windows 7 (64-bit). The Windows version of POPFile 1.1.1 also accepts accented STOPWORDS (on Windows 7 64-bit).

The Windows and Ubuntu versions use the same code to check if the stopword is valid but seem to get different results (see WordMangle.pm for the 1.0.1 release; the 1.1.1 release uses similar code)

When I examined the wordlists for my 'spam' bucket using the Windows version of POPFile 1.1.1 the index included several accented characters, such as á à â ä ã and å. However when I displayed the word list for one of these accented letters the UI showed words which did not start with the selected letter:

Word Table for spam
é   ìçá    16   ëçá     5   ñãè     4   óçð     4   ýèïö    4   ààààüà  3
    ààÿûð   3   ðóáëåé  3   ïßàƒ    3   õëø     3   þïµí    3   ÿâa     3
    ÿÿÿ     3   ÿÿÿÿÿ   3   âèá     2   æèá     2   àupx    2   ðãëÿo   2

(My corpus is about 5 years old now and is the result of updates from many versions of POPFile so it is not really surprising that the wordlists are in a mess. Perhaps I need to start from scratch again?)

Change History

10/26/13 21:53:40 changed by amatubu

The cause of the difference between Windows and Ubuntu may be the 'locale'.

Current version of POPFile uses System's locale to check a character is valid for a word.

The future version (maybe v2?) of POPFile should support Unicode (convert character encoding to UTF-8) and then the usable characters as word will be same in any environment (OS and language settings).

Naoki