[i18n] Htmlspecialchars multibyte charset bug

Viljo Viitanen vviitane+mail.imp@mappi.helsinki.fi
Thu Nov 7 21:26:49 2002


Slow reply. I checked the archive just now, after joining this list today.

I wrote a few weeks ago:
>>The php htmlspecialchars does _very_ nasty things when dealing with
>>multi-byte text, say utf-8. It somehow automagically decodes the utf8
>>characters in current locale (latin1, for most people) to 8 bit octets.

To which Jan Schneider replied:
>Are you sure? I know that htmlspecialchars is broken for a lot of multibyte
>charsets but I always thought (but never approved) that is has support for
>utf-8.

I'm pretty sure (I cannot test this with other charsets than utf-8, it's the
only multibyte charset I know something about). See this simple test:

htmlspecialchars("ä",ENT_QUOTES,"utf-8") output is "ä".

htmlspecialchars("ä",ENT_QUOTES) works as it should, however.

(this is the case with PHP 4.2.3 compiled from source on Debian 3.0)

Anyway, the "funny" side-effect of the bug is that imp 3.1 displays utf-8
encoded mails "correctly" by accident when using locales using the default
charset iso-8859-1. But the function really is broken, OR, my understanding
of the php manual
(http://www.php.net/manual/en/function.htmlspecialchars.php) is broken...


-- 
Viljo Viitanen

(please use address Viljo.Viitanen@helsinki.fi for personal replies)


More information about the i18n mailing list