[i18n] Htmlspecialchars multibyte charset bug

Jan Schneider jan@horde.org
Fri Nov 8 02:41:05 2002


Quoting Viljo Viitanen <vviitane+mail.imp@mappi.helsinki.fi>:

> Slow reply. I checked the archive just now, after joining this list
> today.
> 
> I wrote a few weeks ago:
> >>The php htmlspecialchars does _very_ nasty things when dealing with
> >>multi-byte text, say utf-8. It somehow automagically decodes the utf8
> >>characters in current locale (latin1, for most people) to 8 bit octets.
> 
> To which Jan Schneider replied:
> >Are you sure? I know that htmlspecialchars is broken for a lot of
> multibyte
> >charsets but I always thought (but never approved) that is has support
> for
> >utf-8.
> 
> I'm pretty sure (I cannot test this with other charsets than utf-8, it's
> the
> only multibyte charset I know something about). See this simple test:
> 
> htmlspecialchars("ä",ENT_QUOTES,"utf-8") output is "ä".
> 
> htmlspecialchars("ä",ENT_QUOTES) works as it should, however.
> 
> (this is the case with PHP 4.2.3 compiled from source on Debian 3.0)
> 
> Anyway, the "funny" side-effect of the bug is that imp 3.1 displays utf-8
> encoded mails "correctly" by accident when using locales using the
> default
> charset iso-8859-1. But the function really is broken, OR, my
> understanding
> of the php manual
> (http://www.php.net/manual/en/function.htmlspecialchars.php) is broken...

Hm, I guess this needs some further testing, but looking at the current code
of ext/standard/html.c compared to the version from the 4.2 tree it seem
like the charset entity mapping has been much improved. Anyway, there's
still a long way to complete and working multibyte support in Horde.

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft


More information about the i18n mailing list