[dev] splitting utf8 strings
Karsten Fourmont
fourmont at gmx.de
Fri Nov 17 00:02:58 PST 2006
Hi,
when creating vcard/vcalendar entries in the iCalendar package, long
lines are supposed to be split into multiple shorter lines (called
line folding).
Something like
$s = chunk_split( $s, 76, "\r\n ");
though actually some less elegant manual code using substr is used at
the moment.
However I found an issue with this approach when dealing with UTF8
encoded strings: there's a chance that two byte utf8 characters (like
umlauts: "ä") are split between the first and second byte. So
first_byte_of_char newline second_byte_of_char
When the result is put into an XML document, some XML parsers choke on
that. (notably the one built into windows). Bcause the resulting byte
sequence may not be valid utf8.
Strange, but it does happen.
So the only way to do a chunk-splitting "utf8-safe" seems to be using
perl like regular expressions with the "u" pattern modifier.
Or does anyboday have another suggestion?
Problems the world doesn't need...
Cheers,
Karsten
More information about the dev
mailing list