[imp] Subject line encoding problem

Fri Dec 7 20:31:57 UTC 2007

Quoting Tim Bannister <Tim.Bannister at manchester.ac.uk>:

> On Fri, Dec 07, 2007 at 12:00:04PM +0000, imp-request at lists.horde.org wrote:
>>
>> > Also, why does Horde/Imp require the user to select the encoding in the
>> > first place? Desktop clients usually select the most appropriate
>> > encoding automatically. Kmail, for instance, usually uses either
>> > US-ASCII or ISO-8859-1. But if i type some Japanese characters into an
>> > e-mail, it automatically switches to ISO-2022-JP (the most common
>> > encoding for Japanese e-mail). I'm sending this message from Kmail as
>> > UTF-8.
>>
>> PHP doesn't support automatic detecting of charsets, and it's even
>> much more complicated to detect it client-side, i.e. when typing the
>> message. We already do choose the most appropriate encoding though,
>> because we choose the encoding that matches the currently selected
>> interface language.
>
> However, data outside IS0-8859-1 are usually sent as SGML entities.
> That's enough to infer an encoding (UCS-2). If there aren't any entities
> and no encoding was specified then it seems reasonable for Horde to
> infer ISO-8859-1.

I don't think this is right any more. The form is submitted as  
multipart/form-data (RFC 2388), but IMP (PHP) often can't tell how the  
parts are encoded. I'll attach a couple of sample submissions. The two  
attachments were generated with WebKit (Safari 3.0.4) by varying the  
Content-Type header sent by compose.php

In my tests with Safari, Firefox and also Internet Explorer I found  
that the character encoding is not indicated on submission but is  
consistently derived from the encoding of the document in which the  
form appears. Well, this varies depending on what the user agent asks  
for (for example, I get UTF-8), but the key point is it knows it. The  
thing about entities is a bit of a red herring; it's how IE submits  
characters it can't directly encode, and other browsers have copied  
this. The entities get decoded before being used in a message body.

If IMP sets "charset" to the same value set in the HTTP headers for  
the form, that encoding will be used for the submitted data by the  
three popular browsers.  It seems accurate enough that it could  
perhaps become a hidden input.

PS. There's some background information in Mozilla bugs 18643 and 228779:
https://bugzilla.mozilla.org/show_bug.cgi?id=18643
https://bugzilla.mozilla.org/show_bug.cgi?id=228779

PPS. Some internationalised text: €, £, $, русский, 日本語。Also, the  
literal characters "ampersand hash three eight semicolon": &#38;

-- 
Tim Bannister
IT Services

e: Tim.Bannister at manchester.ac.uk
w: http://www.manchester.ac.uk/itservices
-------------- next part --------------
A non-text attachment was scrubbed...
Name: form-data.iso-8859-1
Type: application/octet-stream
Size: 6119 bytes
Desc: not available
Url : http://lists.horde.org/archives/imp/attachments/20071207/51c451a5/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: form-data.utf-8
Type: application/octet-stream
Size: 6067 bytes
Desc: not available
Url : http://lists.horde.org/archives/imp/attachments/20071207/51c451a5/attachment-0003.obj