[imp] problem with attachments in unicode (UTF16)

Otto Stolz Otto.Stolz at uni-konstanz.de
Tue Mar 25 10:01:05 UTC 2008


Hello,

Andrew Morgan schrieb:
> I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as the 
> client, and the latest stable releases of Horde and IMP with PHP5 on 
> Debian Etch as the server.  The browser says the content-type is 
> text/plain when it uploads the attachment.  Here is the exact attachment 
> that was sent in the email:
> 
> --=_3uemsho7ppkw
> Content-Type: text/plain;
>          charset=UTF-8;
>          name="unicode.txt"
> Content-Disposition: attachment;
>          filename="unicode.txt"
> Content-Transfer-Encoding: quoted-printable
> 
> =FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
> =00i=00n=00 =00U=00T=00F=001=006=00=0D=00
> =00=0D=00
> =00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
> =00
> --=_3uemsho7ppkw--
> 
> It used quoted-printable encoding instead of Base64.  I'm not a 
> quoted-printable whiz, but it appears that the high-order bits get encoded 
> as 00 (NUL) values.  When I download this same attachment using IMP, it is 
> identical to your original unicode.txt file.  However, I suspect 
> Thunderbird and Outlook are not combining the two bytes of data back 
> together (=FF=FE into FFEE) but are trying to render the NUL character.

The problem is the wrong charset specification: For an UTF-16 encoded text,
it should, of course, read “UTF-16”, rather than “UTF-8”. I guess, that
wrong specification stems from the browser used to upload that file.

Because of that wrong specification, the adressee will not interpret the
text as intended. In particular:
- The individual bytes will not be assembled into 16-bit units.
- Any bytes above 127 will be interpreted according to UTF-8 rules;
   in particular, the two leading bytes (meant as BOM) will be considered
   as illegal input values, and most probably be replaced with Replacement
   Characters U+FFFD.
- In due course, the endianess of the UTF-16 text will be lost.
   That particular text is little-endian; the UTF-8 bytes will be
   interpreted in the opposite sequence. Hence, the two halfs of each
   16-bit unit will effectievely be swapped, and even if you try
   to read the attachment as a UTF-16 file, you’ll be out of luck.

The quoted-printable encoding is alright; the Content-Transfer-Encoding
is totally irrelevant for the problems the two preceding posts in this
thread have described.

Good luck,
   Otto Stolz








More information about the imp mailing list