[imp] Subject line encoding problem
Tim Bannister
Tim.Bannister at manchester.ac.uk
Fri Dec 7 20:31:57 UTC 2007
Quoting Tim Bannister <Tim.Bannister at manchester.ac.uk>:
> On Fri, Dec 07, 2007 at 12:00:04PM +0000, imp-request at lists.horde.org wrote:
>>
>> > Also, why does Horde/Imp require the user to select the encoding in the
>> > first place? Desktop clients usually select the most appropriate
>> > encoding automatically. Kmail, for instance, usually uses either
>> > US-ASCII or ISO-8859-1. But if i type some Japanese characters into an
>> > e-mail, it automatically switches to ISO-2022-JP (the most common
>> > encoding for Japanese e-mail). I'm sending this message from Kmail as
>> > UTF-8.
>>
>> PHP doesn't support automatic detecting of charsets, and it's even
>> much more complicated to detect it client-side, i.e. when typing the
>> message. We already do choose the most appropriate encoding though,
>> because we choose the encoding that matches the currently selected
>> interface language.
>
> However, data outside IS0-8859-1 are usually sent as SGML entities.
> That's enough to infer an encoding (UCS-2). If there aren't any entities
> and no encoding was specified then it seems reasonable for Horde to
> infer ISO-8859-1.
I don't think this is right any more. The form is submitted as
multipart/form-data (RFC 2388), but IMP (PHP) often can't tell how the
parts are encoded. I'll attach a couple of sample submissions. The two
attachments were generated with WebKit (Safari 3.0.4) by varying the
Content-Type header sent by compose.php
In my tests with Safari, Firefox and also Internet Explorer I found
that the character encoding is not indicated on submission but is
consistently derived from the encoding of the document in which the
form appears. Well, this varies depending on what the user agent asks
for (for example, I get UTF-8), but the key point is it knows it. The
thing about entities is a bit of a red herring; it's how IE submits
characters it can't directly encode, and other browsers have copied
this. The entities get decoded before being used in a message body.
If IMP sets "charset" to the same value set in the HTTP headers for
the form, that encoding will be used for the submitted data by the
three popular browsers. It seems accurate enough that it could
perhaps become a hidden input.
PS. There's some background information in Mozilla bugs 18643 and 228779:
https://bugzilla.mozilla.org/show_bug.cgi?id=18643
https://bugzilla.mozilla.org/show_bug.cgi?id=228779
PPS. Some internationalised text: €, £, $, русский, 日本語。Also, the
literal characters "ampersand hash three eight semicolon": &
--
Tim Bannister
IT Services
e: Tim.Bannister at manchester.ac.uk
w: http://www.manchester.ac.uk/itservices
-------------- next part --------------
A non-text attachment was scrubbed...
Name: form-data.iso-8859-1
Type: application/octet-stream
Size: 6119 bytes
Desc: not available
Url : http://lists.horde.org/archives/imp/attachments/20071207/51c451a5/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: form-data.utf-8
Type: application/octet-stream
Size: 6067 bytes
Desc: not available
Url : http://lists.horde.org/archives/imp/attachments/20071207/51c451a5/attachment-0003.obj
More information about the imp
mailing list