[dev] splitting utf8 strings

Karsten Fourmont fourmont at gmx.de
Fri Nov 17 00:02:58 PST 2006


Hi,

when creating vcard/vcalendar entries in the iCalendar package, long  
lines are supposed to be split into multiple shorter lines (called  
line folding).

Something like
$s = chunk_split( $s, 76, "\r\n ");
though actually some less elegant manual code using substr is used at  
the moment.

However I found an issue with this approach when dealing with UTF8  
encoded strings: there's a chance that two byte utf8 characters (like  
umlauts: "ä") are split between the first and second byte. So

first_byte_of_char newline second_byte_of_char

When the result is put into an XML document, some XML parsers choke on  
that. (notably the one built into windows). Bcause the resulting byte  
sequence may not be valid utf8.

Strange, but it does happen.

So the only way to do a chunk-splitting "utf8-safe" seems to be using  
perl like regular expressions with the "u" pattern modifier.

Or does anyboday have another suggestion?

Problems the world doesn't need...

Cheers,
  Karsten




More information about the dev mailing list