[horde] switching to UTF8 database

Andrew Morgan morgan at orst.edu
Wed Feb 8 02:38:13 UTC 2012


On Tue, 7 Feb 2012, Jan Schneider wrote:

>
> Zitat von Andrew Morgan <morgan at orst.edu>:
>
>> On Tue, 7 Feb 2012, Jan Schneider wrote:
>> 
>>> 
>>> Zitat von Andrew Morgan <morgan at orst.edu>:
>>> 
>>>> Is this really a problem for latin1 to utf8?  If there are no conversions 
>>>> that result in a double-byte character, then I think it would be okay.
>>> 
>>> That has nothing to do with the charset though. If you data only contains 
>>> ascii data, it doesn't matter to which charset you convert, indeed. As 
>>> soon it latin1-specific characters, it will break.
>> 
>> Can you explain why?  How does the string length change?  Perhaps an 
>> example would help me understand.  :)
>
> latin1, windows-1251, iso-8859-1, etc. are single-byte charsets, each 
> character is a byte.
> UTF-8 is multi-byte charset with variable character lengths. ASCII characters 
> in UTF-8 are still single bytes with the same code like in US-ASCII, but any 
> non-ASCII characters like äöuáé etc take at least two bytes.

I made a test table and PHP script to play with this.  Oh god this problem 
is ugly!

I don't know if this will come out right, but here is what I found.

Using these starting values:

 	val=äöuáé
 	serval=s:9:"äöuáé";

1. Client: latin1   Table: latin1
 	stored in MySQL: s:9:"äöuáé";
 	PHP can correctly deserialize previously stored values

2. Client: latin1   Table: utf8
 	stored in MySQL: s:9:"äöuáé";
 	PHP can correctly deserialize previously stored values

3. Client: utf8     Table: utf8
 	stored in MySQL: s:9:"äöuáé";
 	PHP cannot deserialize latin1 stored values

4. Client: utf8     Table: latin1
 	stored in MySQL: s:9:"äöuáé";
 	PHP cannot deserialize latin1 stored values


So if I am understanding this right, once I change the CLIENT charset from 
latin1 to utf8, everything that is not ASCII will be wrong.

Is there any tool to scan all my tables for non-ASCII values?  Maybe I'll 
be lucky and my users will have only used ASCII....

 	Andy


More information about the horde mailing list