Ticket #139 (new defect)

Opened 3 years ago

Last modified 3 years ago

Email character encoding not respected in POPFile UI (cyrillic doesn't show)

Reported by: valexiev Assigned to: admin
Priority: high Milestone:
Component: User Interface Version: 1.1.0
Severity: normal Keywords:
Cc:

Description

In the POPFile UI tab "History" the messages are displayed with the wrong encoding. That happens with KOI8-R and UTF8 encodings, at least with quoted-printable encoding. I imagine this would also impact corpus construction. Two examples follow:

1. Subject: =?koi8-r?B?89LF3cEgMiAgzcHS1CDyxcfMwc3FztQg2sEg0NLPxsXTyc/OwczOzyDSwdo=?= =?koi8-r?B?18nUycU=?= Content-Type: multipart/alternative; boundary="----=_NextPart_000_0107_01CAB9B3.129DB390"


Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: quoted-printable úÄÒÁ×ÅÊÔÅ ËÏÌÅÇÉ,

I see Subject: óÒÅÝÁ 2 ÍÁÒÔ òÅÇÌÁÍÅÎÔ ÚÁ ÐÒÏÆÅÓÉÏÎÁÌÎÏ ÒÁÚ×ÉÔÉÅ úÄÒÁ×ÅÊÔÅ ËÏÌÅÇÉ, ðÏ ÐÒÅÄÌÏÖÅÎÉÅ

Instead I should see: Subject: Среща 2 март Здравейте колеги, По предложение

2. Subject: =?UTF-8?B?W0pJUkFdIEFzc2lnbmVkOiAoUE9CRFctMQ==?= =?UTF-8?B?MDQpINCh0YrQt9C00LDQstCw0L3QtSDQvtC/0LjRgdCw0L3QuNC1INC30LAg?= =?UTF-8?B?0LjQt9Cy0LvQuNGH0LDQvdC10YLQviA=?= =?UTF-8?B?0L3QsCDQv9GA0LXQv9C40YHQutC4INGB?= =?UTF-8?B?INGA0LDQt9C70LjRh9C90LjRgtC1INGB?= =?UTF-8?B?0YLQsNGC0YPRgdC4INC+0YIg0KPQmNCh?= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Създаване описание за извличането на преписки с различните

статуси от УИС

I see Subject: [JIRA] Assigned: (POBDW-104) Създаване описание за извличането на преписки с различните статуси от УИÐ

Създаване описание за извличането на преписки с различните

статуси от УИС

Instead I should see: [JIRA] Assigned: (POBDW-104) Създаване описание за извличането на преписки с различните статуси от УИС

Създаване описание за извличането на преписки с различните статуси от УИС

Change History

03/03/10 13:51:26 changed by brian

Although the default language is (American) English the POPFile UI supports over 30 other languages, including Russian and Ukrainian. Some translations are not complete therefore some text may still be in English; we rely upon users supplying corrections and updates.

The "Configuration" page in the UI can be used to select the language to be used for the UI. This language selection can also affect how POPFile analyses email messages (but I am not sure if this applies to Russian).

Are you using the latest version of POPFile (1.1.1)?

Which language have you set the UI to use?

Brian

03/03/10 17:45:50 changed by valexiev

The UI appears ok in Cyrillic languages. But email encoding is still not respected.

I tried two other UI languages and checked the page encoding (shown in parens). Example 2 above looks like this:

* Russian (KOI8-R): [JIRA] Assigned: (POBDW-104) п║я┼п╥п╢п╟п╡п╟п╫п╣ п╬п©п╦я│п╟п╫п╦п╣ п╥п╟ п╦п╥п╡п╩п╦я┤п╟п╫п╣я┌п╬ п╫п╟ п©я─п╣п©п╦я│п╨п╦ я│ я─п╟п╥п╩п╦я┤п╫п╦я┌п╣ я│я┌п╟я┌я┐я│п╦ п╬я┌ пёп≤п║

* Bulgarian (windows-1251): [JIRA] Assigned: (POBDW-104) Създаване описание за извличането на преписки с различните статуси от У�С

Question: does POPFile use Unicode internally? If not, how can it hope to handle properly emails and email parts in different encodings? At least it won't be able to train on the same word written in two different encodings.

If the answer is YES, I think the right approach would be:

* Email parts should be transcoded to Unicode for uniformity

* POPFile UI should always use UTF8 instead of language-specific encodings.

Then text from different emails embedded in the UI will look correctly

03/03/10 17:47:30 changed by valexiev

language selection can also affect how POPFile analyses email messages

This seems wrong. Many people get emails in several languages, and certainly spam is not unilingual. UI language preference should not affect the internal working of popfile

03/03/10 19:12:47 changed by brian

language selection can also affect how POPFile analyses email messages

This seems wrong. Many people get emails in several languages, and certainly spam is not unilingual. UI language preference should not affect the internal working of popfile

POPFile uses a naïve Bayes algorithm to classify email. In other words, POPFile uses statistics to track which words are likely to appear in which messages. Japanese words are not separated by spaces (the English equivalent would look like "Japanesewordsarenotseparatedbyspaces"). This makes it harder for POPFile to classify messages.

Therefore POPFile uses a special parser to split Japanese text into words to allow the text to be analysed properly. POPFile currently offers a choice of three Nihongo parsers. If the UI language is set to 'Nihongo' (i.e. Japanese) then POPFile will use the selected parser to split the text into 'words' before it is analysed.

I believe some extra work is done if 'Chinese' or 'Korean' is selected for the UI language.

The Windows installer installs some extra Perl packages when 'Nihongo' is selected, including some which handle character encoding. I suppose it is possible that some of them may also be required in your case.

I don't know much about this and I am not a Perl programmer so looking at the code does not tell me much.

Brian

03/04/10 08:48:27 changed by valexiev

If the UI language is set to 'Nihongo' (i.e. Japanese) then POPFile will use the selected parser

The Windows installer installs some extra Perl packages when 'Nihongo' is selected

I checked the source and you are right on both counts. Which means it won't quite work if one sets 'Nihongo' in the UI, but had not selected it in the installer. (One has to rerun the installer to Add the 'Nihongo' parser.)

Another problem:

* many modules do "use locale" which sets the locale from environment variables (eg LANG). This affects sorting, case conversion and regexps (eg the [[:alpha]] character class).

* but the locale is not updated if the language is set in the UI. (The only place setlocale is used is this stopgap in Bayes.pm:

# In Japanese or Korean or Chinese mode, explicitly set LC_COLLATE and LC_CTYPE to C. This is to avoid Perl crash on Windows

The real big problem is that popfile doesn't use Unicode and doesn't interpret character-encoding headers in email. This means that it can't interpret mixed-language emails quite right. Luckily the classifier still works, even if it interprets data as bytes instead of characters. But the following cannot be fixed without major overhaul:

* email char encoding is lost, so the UI displays wrong chars in History and message display

* words that differ by case are not identified in the classifier

* the same word written using different encodings is not identified in the classifier

03/04/10 14:06:39 changed by brian

it won't quite work if one sets 'Nihongo' in the UI, but had not selected it in the installer. (One has to rerun the installer to Add the 'Nihongo' parser.)

This is a compromise. The 'internal' parser is small and fast but not very accurate; 'Kakasi' is better and only a few MB in size while 'MeCab' is more accurate than 'Kakasi' but adds about 50 MB to the installation folder.

Brian