[imp] problem with attachments in unicode (UTF16)

Tue Mar 25 10:10:32 UTC 2008

I've tested the upload with all browsers I have
- IE6 (windows XP)
- IE7 (windows XP)
- Firefox 2 (windows XP)
- konqueror (Knoppix)

All gave the same wrong result.

Philip

Otto Stolz schreef:
> Hello,
> 
> Andrew Morgan schrieb:
>> I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as 
>> the client, and the latest stable releases of Horde and IMP with PHP5 
>> on Debian Etch as the server.  The browser says the content-type is 
>> text/plain when it uploads the attachment.  Here is the exact 
>> attachment that was sent in the email:
>>
>> --=_3uemsho7ppkw
>> Content-Type: text/plain;
>>          charset=UTF-8;
>>          name="unicode.txt"
>> Content-Disposition: attachment;
>>          filename="unicode.txt"
>> Content-Transfer-Encoding: quoted-printable
>>
>> =FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
>> =00i=00n=00 =00U=00T=00F=001=006=00=0D=00
>> =00=0D=00
>> =00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
>> =00
>> --=_3uemsho7ppkw--
>>
>> It used quoted-printable encoding instead of Base64.  I'm not a 
>> quoted-printable whiz, but it appears that the high-order bits get 
>> encoded as 00 (NUL) values.  When I download this same attachment 
>> using IMP, it is identical to your original unicode.txt file.  
>> However, I suspect Thunderbird and Outlook are not combining the two 
>> bytes of data back together (=FF=FE into FFEE) but are trying to 
>> render the NUL character.
> 
> The problem is the wrong charset specification: For an UTF-16 encoded text,
> it should, of course, read “UTF-16”, rather than “UTF-8”. I guess, that
> wrong specification stems from the browser used to upload that file.
> 
> Because of that wrong specification, the adressee will not interpret the
> text as intended. In particular:
> - The individual bytes will not be assembled into 16-bit units.
> - Any bytes above 127 will be interpreted according to UTF-8 rules;
>   in particular, the two leading bytes (meant as BOM) will be considered
>   as illegal input values, and most probably be replaced with Replacement
>   Characters U+FFFD.
> - In due course, the endianess of the UTF-16 text will be lost.
>   That particular text is little-endian; the UTF-8 bytes will be
>   interpreted in the opposite sequence. Hence, the two halfs of each
>   16-bit unit will effectievely be swapped, and even if you try
>   to read the attachment as a UTF-16 file, you’ll be out of luck.
> 
> The quoted-printable encoding is alright; the Content-Transfer-Encoding
> is totally irrelevant for the problems the two preceding posts in this
> thread have described.
> 
> Good luck,
>   Otto Stolz
> 
> 
> 
> 
> 
>