[imp] multibyte charset bugs and iconv support hack in imp (3.1)

Viljo Viitanen vviitane+mail.imp@mappi.helsinki.fi
Fri Oct 25 22:20:29 2002


First compile php with iconv support.

Edit imp/lib/MIME/Viewer/text.php:

Change line

$text = htmlspecialchars($text, ENT_QUOTES, $charset);

to

$text = str_replace(array('>', '<', '"', '&', "'"),array('&gt;', '&lt;',
'&quot;', '&amp;', '&#039;') , $text);

The php htmlspecialchars does _very_ nasty things when dealing with
multi-byte text, say utf-8. It somehow automagically decodes the utf8
characters in current locale (latin1, for most people) to 8 bit octets. Of
course, this makes imp badly broken for multibyte (well, at least utf8)
encoded mails, they do not work even when opened in the new window where
charset is set in http headers.

Jan Schneider has said on the list that there's nothing that can be done
about the htmlspecialchars bugs, but I disagree. The str_replace may be not
perfect, but it works, at least for utf8. It is broken for other
not-so-clever multibyte encodings, if they use the octets for the ascii
characters "'<>& for other chars. Oh well. One _could_ first encode from the
offending charset to utf-8 with iconv, then replace the "'<>& and then
encode back...


Then, the iconv hack for displaying messages: add

ini_set('display_errors',0);
$conv=iconv(strtoupper($mime->charset),strtoupper($GLOBALS['registry']->getCharset()).'//TRANSLIT',$text);
ini_set('display_errors',1);
if ($conv != FALSE) {
  $text=$conv;
}

before the sprintf with "This message was written in a character set other
than your own...".

I believe something like this could be very useful to add to Imp. This can
be completely optional, iconv extension can be checked runtime with
extension_loaded.

(The usefulness of the whole iconv thingy is left as en excercise for the
reader :))

Anyway, even after this iconv hack support for different character sets has
problems. In mail headers, mime-encoding (like this:
=?UTF-8?B?RW50w6RzIG7DpGluPw==?=) is just decoded to octets, and if the
encoding is not the encoding used in the page, tough. The above string is
then shown as "Entäs näin?" (with latin1) when it should be "Entäs näin?".
 That could be probably be fixed with some simple iconv magic, but I haven't
checked that yet.

-- 
Viljo Viitanen