[imp] problem with attachments in unicode (UTF16)
Philip Steeman
philip.steeman at khbo.be
Tue Mar 25 10:10:32 UTC 2008
I've tested the upload with all browsers I have
- IE6 (windows XP)
- IE7 (windows XP)
- Firefox 2 (windows XP)
- konqueror (Knoppix)
All gave the same wrong result.
Philip
Otto Stolz schreef:
> Hello,
>
> Andrew Morgan schrieb:
>> I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as
>> the client, and the latest stable releases of Horde and IMP with PHP5
>> on Debian Etch as the server. The browser says the content-type is
>> text/plain when it uploads the attachment. Here is the exact
>> attachment that was sent in the email:
>>
>> --=_3uemsho7ppkw
>> Content-Type: text/plain;
>> charset=UTF-8;
>> name="unicode.txt"
>> Content-Disposition: attachment;
>> filename="unicode.txt"
>> Content-Transfer-Encoding: quoted-printable
>>
>> =FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
>> =00i=00n=00 =00U=00T=00F=001=006=00=0D=00
>> =00=0D=00
>> =00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
>> =00
>> --=_3uemsho7ppkw--
>>
>> It used quoted-printable encoding instead of Base64. I'm not a
>> quoted-printable whiz, but it appears that the high-order bits get
>> encoded as 00 (NUL) values. When I download this same attachment
>> using IMP, it is identical to your original unicode.txt file.
>> However, I suspect Thunderbird and Outlook are not combining the two
>> bytes of data back together (=FF=FE into FFEE) but are trying to
>> render the NUL character.
>
> The problem is the wrong charset specification: For an UTF-16 encoded text,
> it should, of course, read “UTF-16”, rather than “UTF-8”. I guess, that
> wrong specification stems from the browser used to upload that file.
>
> Because of that wrong specification, the adressee will not interpret the
> text as intended. In particular:
> - The individual bytes will not be assembled into 16-bit units.
> - Any bytes above 127 will be interpreted according to UTF-8 rules;
> in particular, the two leading bytes (meant as BOM) will be considered
> as illegal input values, and most probably be replaced with Replacement
> Characters U+FFFD.
> - In due course, the endianess of the UTF-16 text will be lost.
> That particular text is little-endian; the UTF-8 bytes will be
> interpreted in the opposite sequence. Hence, the two halfs of each
> 16-bit unit will effectievely be swapped, and even if you try
> to read the attachment as a UTF-16 file, you’ll be out of luck.
>
> The quoted-printable encoding is alright; the Content-Transfer-Encoding
> is totally irrelevant for the problems the two preceding posts in this
> thread have described.
>
> Good luck,
> Otto Stolz
>
>
>
>
>
>
More information about the imp
mailing list