Opened 13 years ago
Last modified 10 years ago
#141 new defect
Accented characters not handled properly by Windows version
Reported by: | Brian Smith | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Database | Version: | 1.1.1 |
Severity: | normal | Keywords: | stopwords foreign accents |
Cc: |
Description
This ticket has been raised in response to the HELP forum topic "(encoding) french accent"
The Windows version of POPFile is happy to add (and remove) accented STOPWORDS via the ADVANCED page in the UI but the equivalent Ubuntu version refuses to accept these words and displays this error message:
Ignored words can only contain alphanumeric, ., _, -, or @ characters
Tested this using the word 'école' with POPFile 1.0.1 on Ubuntu 9.04 (32-bit) and Windows 7 (64-bit). The Windows version of POPFile 1.1.1 also accepts accented STOPWORDS (on Windows 7 64-bit).
The Windows and Ubuntu versions use the same code to check if the stopword is valid but seem to get different results (see WordMangle.pm for the 1.0.1 release; the 1.1.1 release uses similar code)
When I examined the wordlists for my 'spam' bucket using the Windows version of POPFile 1.1.1 the index included several accented characters, such as á à â ä ã and å. However when I displayed the word list for one of these accented letters the UI showed words which did not start with the selected letter:
Word Table for spam é ìçá 16 ëçá 5 ñãè 4 óçð 4 ýèïö 4 ààààüà 3 ààÿûð 3 ðóáëåé 3 ïßàƒ 3 õëø 3 þïµí 3 ÿâa 3 ÿÿÿ 3 ÿÿÿÿÿ 3 âèá 2 æèá 2 àupx 2 ðãëÿo 2
(My corpus is about 5 years old now and is the result of updates from many versions of POPFile so it is not really surprising that the wordlists are in a mess. Perhaps I need to start from scratch again?)
The cause of the difference between Windows and Ubuntu may be the 'locale'.
Current version of POPFile uses System's locale to check a character is valid for a word.
The future version (maybe v2?) of POPFile should support Unicode (convert character encoding to UTF-8) and then the usable characters as word will be same in any environment (OS and language settings).
Naoki