[imp] problem with attachments in unicode (UTF16)

Andrew Morgan morgan at orst.edu
Wed Mar 26 17:43:37 UTC 2008


On Wed, 26 Mar 2008, Otto Stolz wrote:

> Hello,
>
> Michael M Slusarz schrieb:
>> No - this is incorrect.  The correct (and unfortunate) answer is that
>> we can not detect the charset of a text attachment if it is in a
>> different charset than the browser.  Browser upload information does
>> not contain the charset of the uploaded data, only the type - all we
>> have to go by is the charset the browser reports to us via the HTTP
>> headers.
> ...
>> The greater issue is that PHP provides us no means to determine what
>> the charset of the given file is.
>
> You could inspect the leading two or three bytes of the uploaded
> text file:
> - If they are EF BB BF, it is almost certainly UTF-8.
> - If they are FE FF, it is most probably UTF-16BE.
> - If they are FF FE, it is most probably UTF-16LE.
>
> This would correctly identify every Unicode-encoded text file
> uploaded from a Windows system (which still constitutes the
> majority of the end-user systems). Of course, this method does
> not detect every encoding from every end-user system, but it
> would make a great step toward a correct tagging of text type
> attachments.

This seems like a reasonable method to detect UTF-16 encoded text files. 
I'm not sure about using it for UTF-8 though.  Wikipedia says:

   Although not part of the standard, many Windows programs (including
   Windows Notepad) use the byte sequence EF BB BF at the beginning of a
   file to indicate that the file is encoded using UTF-8. This is the Byte
   Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1
   characters "" in most text editors and web browsers not prepared
   to handle UTF-8.

> If the uploaded text file does not contain a BOM, you could
> take the first entry from the Accept-Charset header as a guess
> for the file?s encoding. This is, of course, less reliable,
> but would be right for most files from out-of-the-box browser
> installations.

If IMP is not able to use the Byte Order Mark to detect the encoding, then 
it should assume the file is encoded using the currently selected 
language/encoding in Horde.

> To be on the safe side, you could add a Charset field to the
> Attachments line in the Message Composition form (similar to
> the Charset field in the header zone of that form). That
> attachment-charset field would be preset to the value resulting
> from the procedure outlined above, but would provide an
> opportunity to override the preset value.

This is probably overkill, and would certainly clutter the interface a 
lot.   :)

 	Andy


More information about the imp mailing list