[imp] problem with attachments in unicode (UTF16)
Andrew Morgan
morgan at orst.edu
Wed Mar 26 17:43:37 UTC 2008
On Wed, 26 Mar 2008, Otto Stolz wrote:
> Hello,
>
> Michael M Slusarz schrieb:
>> No - this is incorrect. The correct (and unfortunate) answer is that
>> we can not detect the charset of a text attachment if it is in a
>> different charset than the browser. Browser upload information does
>> not contain the charset of the uploaded data, only the type - all we
>> have to go by is the charset the browser reports to us via the HTTP
>> headers.
> ...
>> The greater issue is that PHP provides us no means to determine what
>> the charset of the given file is.
>
> You could inspect the leading two or three bytes of the uploaded
> text file:
> - If they are EF BB BF, it is almost certainly UTF-8.
> - If they are FE FF, it is most probably UTF-16BE.
> - If they are FF FE, it is most probably UTF-16LE.
>
> This would correctly identify every Unicode-encoded text file
> uploaded from a Windows system (which still constitutes the
> majority of the end-user systems). Of course, this method does
> not detect every encoding from every end-user system, but it
> would make a great step toward a correct tagging of text type
> attachments.
This seems like a reasonable method to detect UTF-16 encoded text files.
I'm not sure about using it for UTF-8 though. Wikipedia says:
Although not part of the standard, many Windows programs (including
Windows Notepad) use the byte sequence EF BB BF at the beginning of a
file to indicate that the file is encoded using UTF-8. This is the Byte
Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1
characters "" in most text editors and web browsers not prepared
to handle UTF-8.
> If the uploaded text file does not contain a BOM, you could
> take the first entry from the Accept-Charset header as a guess
> for the file?s encoding. This is, of course, less reliable,
> but would be right for most files from out-of-the-box browser
> installations.
If IMP is not able to use the Byte Order Mark to detect the encoding, then
it should assume the file is encoded using the currently selected
language/encoding in Horde.
> To be on the safe side, you could add a Charset field to the
> Attachments line in the Message Composition form (similar to
> the Charset field in the header zone of that form). That
> attachment-charset field would be preset to the value resulting
> from the procedure outlined above, but would provide an
> opportunity to override the preset value.
This is probably overkill, and would certainly clutter the interface a
lot. :)
Andy
More information about the imp
mailing list