[dev] migration script for data with bad charset in SQL database
Vilius Šumskas
vilius at lnk.lt
Fri Sep 9 19:14:58 UTC 2011
Sveiki,
Thursday, September 8, 2011, 3:43:27 PM, you wrote:
>> Hello,
>>
>> I have upgraded Whups recently and stumbled on a charset bug which I
>> never realized still exist in most of the applications.
>>
>> When correct MySQL charset handling was implemented in Horde_Db it
>> automatically enabled mysql <-> php charset conversion (which is
>> good). But all old data was written into the database like latin1
>> (which is the default in most older MySQL installations, check that
>> by connecting with console mysql client to your server, running
>> "show variables like '%char%';" and looking for
>> "character_set_connection"). This was the case even if you set UTF-8
>> in Horde 3.
>>
>> So now, when for example I have UTF-8 as my database charset in
>> Horde 4, I get plain UTF-8 characters appearing in Whups, Agora and
>> probably other ex-MDB2 applications I didn't try. I've never noticed
>> this with ex-PEAR::DB applications as I was running modified
>> PEAR::DB version which took care of UTF-8 handling in Horde 3 the
>> correct way. But I assume that for others with standard PEAR::DB
>> this still could be the problem when upgrading to H4.
>>
>> What needs to be done is a script, which takes all the data, dumps
>> it at default *MySQL* charset which was used with H3. And later
>> imported again through Horde_Db which should take care of data
>> automatically. I can volunteer to do so. However there is couple of
>> issues must be taken care of first.
>>
>> 1. Script must be dependent on default MySQL connection charset used
>> with H3. Do we assume that Horde administrator knows what he is
>> doing and he must input the charset manually? Or do we assume that
>> H3 was installed on the same server the script will be run on and
>> get it automatically from MySQL?
> I'd say let's make this like in the existing convert-to-utf8 scripts
> and prompt for both the input and output charset, but default to
> iso-8859-1 for input and utf-8 for output.
>> 2. Database/table charsets for horde tables needs to be changed to
>> match current Horde 4 charset setting in conf.php. Do you think the
> Only if the target charset is indeed changed. Or does this problem
> only show up when switching to a different charset? In that case, the
> existing convert-to-utf8 script would do, no?
I'll rephrase; database/table charsets needs to be changed if
they doesn't match the setting in conf.php of Horde 4.
General rule for PHP/MySQL client application <-> MySQL Server
communication is that _decription_ charset in the database should match
connection charset. Otherwise internal MySQL conversion could run into
problems when converting to the charset with less characters or
wise versa.
By default, when creating a table or a database the default server
charset is used. Most MySQL installations default to latin1 AFAIK.
As we were not using DEFAULT CHARSET=something in Horde 3
/scripts/, I suspect that there will be enough H3 installations
with anything but charset specified in conf.php.
>> following order for this is OK?
>> a) script dump all the data.
>> b) renames old database for backup purposes.
>> c) creates current H3 *schema* from backed-up old database with
>> correct charsets (like CREATE (....) ENGINE=.... DEFAULT
>> CHARSET=utf8).
> That's not supported by Horde_Db's migrations.
>> d) imports data.
>>
>> 3. Should this script be the part of the upgrade process in every
>> application? But giving the above I'm not sure it is really
>> possible, no?
> I'm still don't understand if this is necessary in any case, or only
> when switching to utf8. If the latter, this should be separated
> conversion script so that the conversion could be done at any time.
*Data* conversion (which is not the same as schema charset update
above) is necessary if MySQL for H3 wasn't configured
with init-connect='SET NAMES correctcharset' or similar. Because Horde 3
didn't set the charset itself. Basically this means that conversion is
necessary with anything other than latin1 in conf.php of Horde 3.
Thinking about this more I also now think that we should provide
a separate convert_to_yourcharset script with separate
instructions on how to check data consistency and convert correctly.
The script needs to be mentioned in INSTALL/UPDATE docs.
This also needs to be mentioned on Configuration -> Database charset
page even for new installs. There should be a clear warning that if
administrator changes the setting, he also must take care about the data
manually. For example if he decide to switch from latin1 to utf8
after couple months of using brand new Horde 4 install.
>> As said, I imagine this bug would affect a fare number of users
>> running MySQL with standard configuration files. For example if you
>> had init-connect='SET NAMES utf8' in [mysql] section of your my.cnf
>> file it would not affect you.
> I've only seen one such issue locally with Wicked recently, but this
> could as well have been some botched conversion while playing with
> migrations and charset conversions. I could probably test this again
> though.
I'm still scraching my head why we don't see any users with this
problem on the mailing lists. Maybe most of them are running Horde
with latin1 setting?
At least here I have upgraded three installations which had this
problem. But all of them were somewhat custom my modified versions so
I cannot test or say anything being 100% sure.
Anyway, I will get to this after I finish Agora conversion which I'm
currently working on.
--
Best regards,
Vilius
More information about the dev
mailing list