[imp] Message-Display ignores MIME charset
Otto Stolz
Otto.Stolz@uni-konstanz.de
Fri, 16 Nov 2001 12:04:47 +0100
Hello,
I had written:
> IMP [...] apparently does ignore the MIME charset label: for the dis-
> play, IMP undoes the content-transfer-encoding, but it tells the browser
> that the text be in ISO 8859-1 encoding. So IMP ef fectively transfers
> bits rather than characters, cf. example [stripped].
Chuck Hagenbuch wrote:
> The main problem is having the message in a charset different from
> the rest of the UI.
In the general case, the incoming message (of arbitrary provenience!)
will not match the UI encoding.
- My customers may well receive messages in a language for which there
is no UI translation available -- and not even desirable (consider
the Ancient Greek example).
- Many languages use various encodings (cf. infra), and it is not
feasable to provide an UI translation in various encodings and let
the user choose the correct one for every single message.
Chuck Hagenbuch wrote:
> If you use IMP in zh_TW, any messages you get in a traditional chinese
> charset will show up appropriately.
This will probably not work with a message you get in a traditional
chinese charset, encoded as EUC-TW (aka ISO-2022-TW or CNS 11643).
(I say "probably", as I cannot read Chinese, so I do not want to take
the pains to verify my conjecture.)
Like many other languges, zh_TW can come in various, mutually in-
compatible, encodings, viz. EUC-TW (aka ISO-2022-TW or CNS 11643),
or Big5, cf. <http://czyborra.com/charsets/cjk.html> and
<http://www.ifcss.org:8001/www/pub/software/info/cjk-codes/>, or,
of course, Unicode (in UTF8, UTF-16, or UTF-32, encoding).
More examples:
- Russian uses KOI-8R, ISO 8859-5, CP 1251, Unicode, or occasionally
CP866, cf. <http://czyborra.com/charsets/cyrillic.html>;
- German uses ISO 8859-1, CP 1252, Unicode, or occasionally CP 437,
CP 850, ISO 8859-2 or CP 1250, cf.
<http://czyborra.com/charsets/codepages.htm>.
Clearly, the only way to go is: let Imp act sensibly, in accordance
with RFC 2047 <http://sunsite.dk/RFC/rfc/rfc2047.html>, i. e.
- display incoming messages according to their respective MIME charset
labels;
- offer the user the chance to override erroneous MIME charset labels.
> Any suggestions on how to display charsets other than that of the
> rest of the UI would be welcome.
Two possibilities come to mind:
- Split the current message frame, imp/message.php3, in two frames, one for
the GUI info and the other one for the message proper. Specify the ap-
propriate encoding for each, i. e. the encoding from imp/config/lang.php3
for the GUI frame, and the MIME charset of the incoming message for the
message-proper frame.
- Use Unicode (UTF-8 encoding) throughout; provide the GUI in UTF-8 encoding
and convert all incoming messages to UTF-8, according to their respective
MIME charset label.
Pros and Cons:
Using the individual message's MIME charset has the following
advantages:
- probably less work to do in Imp, as the browser's charset handling
is exploited;
- user can easily override erroneous charset labels via the browser's
charset setting.
The disadvantage:
- Does not work with MIME-encoded header fields, cf. RFC 2047,
section 2.
Converting everything to UTF-8 has the following advantages:
- Handling the encoding is confined to two or three places,
viz. import from message servers (IMAP|POP) and export of
messages (SMTP), all other code can ignore character encoding
issues;
- uniform encoding throughout, including message folders
(so they can easily viewed, or edited, with other tools);
- works even if various header-fields are encoded distinctly
(which is an exeption, but still legal).
The disadvantages:
- User needs an UTF-8 capable browser, even to view messages
coming in more traditional code pages. This is not a problem
with Netscape Navigator 4+ or Internet explorer 4+, but, e.g.,
with Opera 2, and with older NN and IE versions. This problem
is likely to go away in the near future. (Opera is going to
include UTF-8 handling in the next release, if I am not
mistaken.)
- Handling of mislabeled messages is not easy to implement:
the original message has to be kept for alternative attempts
on code-conversions, as it cannot be derived from the UTF-8
version, in pathological cases (a possible optimization is
to keep a copy only for these pathological cases, which can
be recognized during the initial conversion to UTF-8).
Before precipitating to a rash decision, please note that
there is also a problem with sending messages, which I will
present early next week.
I cannot contribute patches, for lack of knowledge of Horde
internals (including PHP); however, I am willing to con-
tribute my expertise in the character encoding realm towards
a satisfactory solution of this pressing problem.
Best wishes,
Otto Stolz