[dev] splitting utf8 strings

Jan Schneider jan at horde.org
Fri Nov 17 02:43:32 PST 2006


Zitat von Karsten Fourmont <fourmont at gmx.de>:

> Hi,
>
> when creating vcard/vcalendar entries in the iCalendar package, long  
> lines are supposed to be split into multiple shorter lines (called  
> line folding).
>
> Something like
> $s = chunk_split( $s, 76, "\r\n ");
> though actually some less elegant manual code using substr is used  
> at the moment.
>
> However I found an issue with this approach when dealing with UTF8  
> encoded strings: there's a chance that two byte utf8 characters  
> (like umlauts: "ä") are split between the first and second byte. So
>
> first_byte_of_char newline second_byte_of_char
>
> When the result is put into an XML document, some XML parsers choke  
> on that. (notably the one built into windows). Bcause the resulting  
> byte sequence may not be valid utf8.
>
> Strange, but it does happen.
>
> So the only way to do a chunk-splitting "utf8-safe" seems to be  
> using perl like regular expressions with the "u" pattern modifier.
>
> Or does anyboday have another suggestion?

The only other option would be to loop through the whole string char  
by char with multibyte safe functions, which is probably much slower.
If you go the regex route, you probable want to create a  
regexReplace() or regexSplit() method in String:: to utilize the  
mbstring functions where we don't deal with UTF-8 strings.

Jan.

-- 
Do you need professional PHP or Horde consulting?
http://horde.org/consulting/



More information about the dev mailing list