This ticket has been raised in response to the HELP forum topic "(encoding) french accent"
The Windows version of POPFile is happy to add (and remove) accented STOPWORDS via the ADVANCED page in the UI but the equivalent Ubuntu version refuses to accept these words and displays this error message:
Ignored words can only contain alphanumeric, ., _, -, or @ characters
Tested this using the word 'école' with POPFile 1.0.1 on Ubuntu 9.04 (32-bit) and Windows 7 (64-bit). The Windows version of POPFile 1.1.1 also accepts accented STOPWORDS (on Windows 7 64-bit).
The Windows and Ubuntu versions use the same code to check if the stopword is valid but seem to get different results (see WordMangle.pm for the 1.0.1 release; the 1.1.1 release uses similar code)
When I examined the wordlists for my 'spam' bucket using the Windows version of POPFile 1.1.1 the index included several accented characters, such as á à â ä ã and å. However when I displayed the word list for one of these accented letters the UI showed words which did not start with the selected letter:
Word Table for spam
é ìçá 16 ëçá 5 ñãè 4 óçð 4 ýèïö 4 ààààüà 3
ààÿûð 3 ðóáëåé 3 ïßàƒ 3 õëø 3 þïµí 3 ÿâa 3
ÿÿÿ 3 ÿÿÿÿÿ 3 âèá 2 æèá 2 àupx 2 ðãëÿo 2
(My corpus is about 5 years old now and is the result of updates from many versions of POPFile so it is not really surprising that the wordlists are in a mess. Perhaps I need to start from scratch again?)