[dev] migration script for data with bad charset in SQL database

Vilius Šumskas vilius at lnk.lt
Fri Sep 9 19:14:58 UTC 2011


Sveiki,

Thursday, September 8, 2011, 3:43:27 PM, you wrote:

>> Hello,
>>
>> I have upgraded Whups recently and stumbled on a charset bug which I  
>> never realized still exist in most of the applications.
>>
>> When correct MySQL charset handling was implemented in Horde_Db it  
>> automatically enabled mysql <-> php charset conversion (which is  
>> good). But all old data was written into the database like latin1  
>> (which is the default in most older MySQL installations, check that  
>> by connecting with console mysql client to your server, running  
>> "show variables like '%char%';" and looking for  
>> "character_set_connection"). This was the case even if you set UTF-8  
>> in Horde 3.
>>
>> So now, when for example I have UTF-8 as my database charset in  
>> Horde 4, I get plain UTF-8 characters appearing in Whups, Agora and  
>> probably other ex-MDB2 applications I didn't try. I've never noticed  
>> this with ex-PEAR::DB applications as I was running modified  
>> PEAR::DB version which took care of UTF-8 handling in Horde 3 the  
>> correct way. But I assume that for others with standard PEAR::DB  
>> this still could be the problem when upgrading to H4.
>>
>> What needs to be done is a script, which takes all the data, dumps  
>> it at default *MySQL* charset which was used with H3. And later  
>> imported again through Horde_Db which should take care of data  
>> automatically. I can volunteer to do so. However there is couple of  
>> issues must be taken care of first.
>>
>> 1. Script must be dependent on default MySQL connection charset used  
>> with H3. Do we assume that Horde administrator knows what he is  
>> doing and he must input the charset manually? Or do we assume that  
>> H3 was installed on the same server the script will be run on and  
>> get it automatically from MySQL?

> I'd say let's make this like in the existing convert-to-utf8 scripts  
> and prompt for both the input and output charset, but default to  
> iso-8859-1 for input and utf-8 for output.

>> 2. Database/table charsets for horde tables needs to be changed to  
>> match current Horde 4 charset setting in conf.php. Do you think the

> Only if the target charset is indeed changed. Or does this problem  
> only show up when switching to a different charset? In that case, the
> existing convert-to-utf8 script would do, no?

I'll   rephrase;   database/table   charsets  needs  to  be changed if
they doesn't match the setting in conf.php of Horde 4.

General  rule  for  PHP/MySQL  client  application  <->  MySQL  Server
communication  is that _decription_ charset in the database should match
connection charset. Otherwise internal MySQL conversion could run into
problems   when   converting  to  the  charset with less characters or
wise versa.

By  default,  when  creating  a  table  or a database the default server
charset is used. Most MySQL installations default to latin1 AFAIK.
As    we   were   not  using  DEFAULT  CHARSET=something  in  Horde  3
/scripts/,  I  suspect that there  will  be  enough  H3  installations
with  anything but charset specified in conf.php.

>> following order for this is OK?
>>     a) script dump all the data.
>>    b) renames old database for backup purposes.
>>    c) creates current H3 *schema* from backed-up old database with  
>> correct charsets (like CREATE (....) ENGINE=.... DEFAULT  
>> CHARSET=utf8).

> That's not supported by Horde_Db's migrations.

>>    d) imports data.
>>
>> 3. Should this script be the part of the upgrade process in every  
>> application? But giving the above I'm not sure it is really  
>> possible, no?

> I'm still don't understand if this is necessary in any case, or only  
> when switching to utf8. If the latter, this should be separated  
> conversion script so that the conversion could be done at any time.

*Data*  conversion  (which  is  not  the same as schema charset update
above)   is   necessary    if   MySQL   for   H3   wasn't  configured
with init-connect='SET  NAMES  correctcharset'  or similar. Because Horde 3
didn't  set  the charset  itself.  Basically this means that conversion is
necessary with anything other than latin1 in conf.php of Horde 3.

Thinking  about  this  more  I  also  now think that we should provide
a    separate      convert_to_yourcharset    script    with   separate
instructions on how to check data consistency and convert correctly.

The script needs to be mentioned in INSTALL/UPDATE docs.

This also needs  to be mentioned on Configuration -> Database charset
page  even  for  new installs. There should be a clear warning that if
administrator  changes  the  setting, he also must take care about the data
manually.   For   example   if he decide to switch from latin1 to utf8
after  couple months of using brand new Horde 4 install.

>> As said, I imagine this bug would affect a fare number of users
>> running MySQL with standard configuration files. For example if you  
>> had init-connect='SET NAMES utf8' in [mysql] section of your my.cnf  
>> file it would not affect you.

> I've only seen one such issue locally with Wicked recently, but this  
> could as well have been some botched conversion while playing with  
> migrations and charset conversions. I could probably test this again  
> though.

I'm  still  scraching  my  head  why  we don't see any users with this
problem  on  the  mailing  lists. Maybe most of them are running Horde
with latin1 setting?

At  least  here  I  have  upgraded  three installations which had this
problem.  But all of them were somewhat custom my modified versions so
I cannot test or say anything being 100% sure.

Anyway,  I  will get to this after I finish Agora conversion which I'm
currently working on.

-- 
Best regards,
 Vilius



More information about the dev mailing list