[imp] Message-Display ignores MIME charset

Otto Stolz Otto.Stolz@uni-konstanz.de
Fri, 16 Nov 2001 12:04:47 +0100


Hello,

I had written:

> IMP [...] apparently does ignore the MIME charset label: for the dis-

> play, IMP undoes the content-transfer-encoding, but it tells the browser

> that the text be in ISO 8859-1 encoding. So IMP ef fectively transfers

> bits rather than characters, cf. example [stripped].


Chuck Hagenbuch wrote:
> The main problem is having the message in a charset different from

> the rest of the UI.


In the general case, the incoming message (of arbitrary provenience!)
will not match the UI encoding.
- My customers may well receive messages in a language for which there
   is no UI translation available -- and not even desirable (consider
   the Ancient Greek example).
- Many languages use various encodings (cf. infra), and it is not
   feasable to provide an UI translation in various encodings and let
   the user choose the correct one for every single message.

Chuck Hagenbuch wrote:
 > If you use IMP in zh_TW, any messages you get in a traditional chinese
 > charset will show up appropriately.

This will probably not work with a message you get in a traditional
chinese charset, encoded as EUC-TW (aka ISO-2022-TW or CNS 11643).
(I say "probably", as I cannot read Chinese, so I do not want to take
the pains to verify my conjecture.)

Like many other languges, zh_TW can come in various, mutually in-
compatible, encodings, viz. EUC-TW (aka ISO-2022-TW or CNS 11643),
or Big5, cf. <http://czyborra.com/charsets/cjk.html> and
<http://www.ifcss.org:8001/www/pub/software/info/cjk-codes/>, or,
of course, Unicode (in UTF8, UTF-16, or UTF-32, encoding).
More examples:
- Russian uses KOI-8R, ISO 8859-5, CP 1251, Unicode, or occasionally
   CP866, cf. <http://czyborra.com/charsets/cyrillic.html>;
- German uses ISO 8859-1, CP 1252, Unicode, or occasionally CP 437,
   CP 850, ISO 8859-2 or CP 1250, cf.
   <http://czyborra.com/charsets/codepages.htm>.

Clearly, the only way to go is: let Imp act sensibly, in accordance
with RFC 2047 <http://sunsite.dk/RFC/rfc/rfc2047.html>, i. e.
- display incoming messages according to their respective MIME charset
   labels;
- offer the user the chance to override erroneous MIME charset labels.

> Any suggestions on how to display charsets other than that of the

> rest of the UI would be welcome.

Two possibilities come to mind:

- Split the current message frame, imp/message.php3, in two frames, one for
   the GUI info and the other one for the message proper. Specify the ap-
   propriate encoding for each, i. e. the encoding from imp/config/lang.php3
   for the GUI frame, and the MIME charset of the incoming message for the
   message-proper frame.

- Use Unicode (UTF-8 encoding) throughout; provide the GUI in UTF-8 encoding
   and convert all incoming messages to UTF-8, according to their respective
   MIME charset label.

Pros and Cons:

Using the individual message's MIME charset has the following
advantages:
- probably less work to do in Imp, as the browser's charset handling
   is exploited;
- user can easily override erroneous charset labels via the browser's
   charset setting.
The disadvantage:
- Does not work with MIME-encoded header fields, cf. RFC 2047,
   section 2.

Converting everything to UTF-8 has the following advantages:
- Handling the encoding is confined to two or three places,
   viz. import from message servers (IMAP|POP) and export of
   messages (SMTP), all other code can ignore character encoding
   issues;
- uniform encoding throughout, including message folders
   (so they can easily viewed, or edited, with other tools);
- works even if various header-fields are encoded distinctly
   (which is an exeption, but still legal).
The disadvantages:
- User needs an UTF-8 capable browser, even to view messages
   coming in more traditional code pages. This is not a problem
   with Netscape Navigator 4+ or Internet explorer 4+, but, e.g.,
   with Opera 2, and with older NN and IE versions. This problem
   is likely to go away in the near future. (Opera is going to
   include UTF-8 handling in the next release, if I am not
   mistaken.)
- Handling of mislabeled messages is not easy to implement:
   the original message has to be kept for alternative attempts
   on code-conversions, as it cannot be derived from the UTF-8
   version, in pathological cases (a possible optimization is
   to keep a copy only for these pathological cases, which can
   be recognized during the initial conversion to UTF-8).

Before precipitating to a rash decision, please note that
there is also a problem with sending messages, which I will
present early next week.


I cannot contribute patches, for lack of knowledge of Horde
internals (including PHP); however, I am willing to con-
tribute my expertise in the character encoding realm towards
a satisfactory solution of this pressing problem.

Best wishes,
   Otto Stolz