[dev] migration script for data with bad charset in SQL database

Vilius Šumskas vilius at lnk.lt
Fri Jul 22 13:13:50 UTC 2011


Hello,

I have upgraded Whups recently and stumbled on a charset bug which I never realized still exist in most of the applications.

When correct MySQL charset handling was implemented in Horde_Db it automatically enabled mysql <-> php charset conversion (which is good). But all old data was written into the database like latin1 (which is the default in most older MySQL installations, check that by connecting with console mysql client to your server, running "show variables like '%char%';" and looking for "character_set_connection"). This was the case even if you set UTF-8 in Horde 3.

So now, when for example I have UTF-8 as my database charset in Horde 4, I get plain UTF-8 characters appearing in Whups, Agora and probably other ex-MDB2 applications I didn't try. I've never noticed this with ex-PEAR::DB applications as I was running modified PEAR::DB version which took care of UTF-8 handling in Horde 3 the correct way. But I assume that for others with standard PEAR::DB this still could be the problem when upgrading to H4.

What needs to be done is a script, which takes all the data, dumps it at default *MySQL* charset which was used with H3. And later imported again through Horde_Db which should take care of data automatically. I can volunteer to do so. However there is couple of issues must be taken care of first.

1. Script must be dependent on default MySQL connection charset used with H3. Do we assume that Horde administrator knows what he is doing and he must input the charset manually? Or do we assume that H3 was installed on the same server the script will be run on and get it automatically from MySQL? 

2. Database/table charsets for horde tables needs to be changed to match current Horde 4 charset setting in conf.php. Do you think the following order for this is OK?
    a) script dump all the data.
   b) renames old database for backup purposes.
   c) creates current H3 *schema* from backed-up old database with correct charsets (like CREATE (....) ENGINE=.... DEFAULT CHARSET=utf8).
   d) imports data.

3. Should this script be the part of the upgrade process in every application? But giving the above I'm not sure it is really possible, no? 

As said, I imagine this bug would affect a fare number of users running MySQL with standard configuration files. For example if you had init-connect='SET NAMES utf8' in [mysql] section of your my.cnf file it would not affect you.

Thoughts?

-- 
  Best Regards,

  Vilius Šumskas
  LNK TV IT manager
  mob.: +370 614 75713
  http://www.lnk.lt




More information about the dev mailing list