[imp] problem with attachments in unicode (UTF16)

Wed Mar 26 09:36:34 UTC 2008

Hello,

Michael M Slusarz schrieb:
> No - this is incorrect.  The correct (and unfortunate) answer is that  
> we can not detect the charset of a text attachment if it is in a  
> different charset than the browser.  Browser upload information does  
> not contain the charset of the uploaded data, only the type - all we  
> have to go by is the charset the browser reports to us via the HTTP  
> headers.
...
> The greater issue is that PHP provides us no means to determine what  
> the charset of the given file is.

You could inspect the leading two or three bytes of the uploaded
text file:
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.

This would correctly identify every Unicode-encoded text file
uploaded from a Windows system (which still constitutes the
majority of the end-user systems). Of course, this method does
not detect every encoding from every end-user system, but it
would make a great step toward a correct tagging of text type
attachments.

If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file’s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.

To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form). That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.

> There is nothing wrong  
> with the way we Q-P - but if we Q-P using the wrong charset, the data  
> is going to be invalid.

To avoid a possible misunderanding of this wording: The Q-P encoding
poses no problem, even if sailing under false colours, charsetwise.
Q-P simply encodes the bytes 3D, and above 7F, by their hexadekadic
values, which will be decoded without any problem. When tagged as UTF-8,
as in the examples discussed so far, even the byte-order is sure to be
preserved.

The only problem is the wrong Charset tag, as it will cause particular
byte values (or sequences thereof) to be considered illegal and, in due
course, to be replaced with Replacement Characters (or, perhaps, even
dropped).

Best wishes,
   Otto Stolz