[dev] [commits] Horde branch master updated. fa8d644f8ca59a43056e5eae668ff1699d36be24
Michael J Rubinsky
mrubinsk at horde.org
Sat Dec 18 01:22:37 UTC 2010
On Dec 17, 2010, at 6:25 PM, Vilius Šumskas <vilius at lnk.lt> wrote:
> Sveiki,
>
> Monday, December 13, 2010, 6:45:55 PM, you wrote:
>
>> Zitat von Vilius Šumskas <vilius at lnk.lt>:
>
>>>>>> commit fa8d644f8ca59a43056e5eae668ff1699d36be24
>>>>>> Author: Jan Schneider <jan at horde.org>
>>>>>> Date: Tue Nov 30 15:41:59 2010 +0100
>>>>>>
>>>>>> Remove charset methods.
>>>>>>
>>>>>> They are only implemented in the MySQL drivers anyway and using SET
>>>>>> NAMES causes more problems than it solves. Actually it breaks the
>>>>>> current share driver.
>>>>>
>>>>> Does this mean that it will not be implemented and we are stuck with
>>>>> ISO-8859-1 connection charsets forever? Horde is the only
>>>>> professional application which doesn't use charsets correctly on DB
>>>>> level.
>>>>
>>>> DB charset handling has always worked fine in Horde. I removed this
>>>
>>> True if you don't manage Horde's databases with external tools or if
>>> you don't integrate it with other applications.
>>>
>>> As far as I remember there were dozen of questions in the mailling
>>> list "why phpmyadmin displays my Horde database with wrong
>>> characters".
>
>> I have never experienced any issues with several generations of
>> phpMyAdmin, whether the data was UTF-8 or Latin1, whether it was data
>> from H3 or H4.
>
> I'm looking at my "utf8" data from H3 right now with phpMyAdmin
> 2.11.11 and it shows plain Unicode characters instead of normal text.
> Maybe you were using autorecode feature in phpMyAdmin?
>
>>>> In the end, after we removed the SET NAMES call, everything started to
>>>> work properly again, with old data, with new data, with UTF-8 data,
>>>> with Latin1 data.
>>>
>>> By 'old' and 'new' you mean, you've changed db charset in Horde config?
>
>> No, this means with older existing data from Horde 3 as well as new
>> data created with Horde 4.
>
> I have read the irc log and I'm missing some critical information
> like server's configuration and such. But I remember that H3 had some
> missing convertCharset methods in Share code. I saw it couple of
> times. So if you added them later this is expected behaviour.
>
> Also you cannot just *change* character set with SET NAMES without
> converting the data itself and changing database/table/column
> charsets. Note that I've wrote "change" here. That's because all versions
> of MySQL above 4.1 will use any default character set set in configuration
> anyway. Usually it's latin1 (which is actually Windows-1252 in MySQL
> http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html).
> It doesn't matter if you use SET NAMES or not the conversion will be
> done.
>
> All the problems and confusion you are seeing is because old data was
> incorrectly written into database. H3 data needs a migration for H4.
>
> <../>
>
>>> mysql> select length(tekstas) from test;
>>> +-----------------+
>>> | length(tekstas) |
>>> +-----------------+
>>> | 68 |
>>> | 63 |
>>> +-----------------+
>>> 2 rows in set (0.00 sec)
>>>
>>> You can imagine how numbers differ if you have russian or other non
>>> latin text.
>
>> That's really a lame reason. Especially since no one forces you to use
>> a multi-byte charset for the backend storage. If shortness of strings
>> is more important for you than interoperability, you are still free to
>> choose a latin charset or any other legacy charset instead of UTF-8.
>
> As noted above this is not possible with MySQL 4.1 and later anymore.
> Because MySQL does the translation in any case, but without correct
> SET NAMES call it usually sets connection to latin1.
>
> Michael correctly found that is will convert FROM
> character_set_client TO character_set_connection. But also manual
> mentions that conversion is done FROM character_set_connection TO
> character_set_database (or table, or column).
>
> So real conversion works this way for the input:
>
> character_set_client -> character_set_connection -> character_set_database
>
> And this way for the output (result):
>
> character_set_database -> character_set_connection -> character_set_results
>
> Let's take Lithuanian character set as an example. I have
> created database "test" with table "test" and column in cp1257
> charset. Also have set my PuTTY for 'Window-1257 Baltic'.
>
> mysql> show variables like 'char%';
> +--------------------------+----------------------------+
> | Variable_name | Value |
> +--------------------------+----------------------------+
> | character_set_client | latin1 |
> | character_set_connection | latin1 |
> | character_set_database | cp1257 |
> | character_set_filesystem | binary |
> | character_set_results | latin1 |
> | character_set_server | utf8 |
> | character_set_system | utf8 |
> | character_sets_dir | /usr/share/mysql/charsets/ |
> +--------------------------+----------------------------+
> 8 rows in set (0.00 sec)
>
> mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
> Query OK, 1 row affected (0.00 sec)
>
> mysql> select * from test;
> +---------------------------------------------------------------+
> | tekstas |
> +---------------------------------------------------------------+
> | Premjera bus gruod?io 3 dien?, 19 val. "Forum Cinemas Vingis" |
> +---------------------------------------------------------------+
> 1 row in set (0.00 sec)
>
> That's because PuTTY sends everything in Windows-1257 but mysql client thinks
> that the text is really latin1.
>
> Now let's tell mysql client that text really comes in Windows-1257.
>
> mysql> SET character_set_client = 'cp1257';
> Query OK, 0 rows affected (0.00 sec)
> mysql> SET character_set_results = 'cp1257';
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> show variables like 'char%';
> +--------------------------+----------------------------+
> | Variable_name | Value |
> +--------------------------+----------------------------+
> | character_set_client | cp1257 |
> | character_set_connection | latin1 |
> | character_set_database | cp1257 |
> | character_set_filesystem | binary |
> | character_set_results | cp1257 |
> | character_set_server | utf8 |
> | character_set_system | utf8 |
> | character_sets_dir | /usr/share/mysql/charsets/ |
> +--------------------------+----------------------------+
> 8 rows in set (0.00 sec)
>
> mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
> Query OK, 1 row affected (0.00 sec)
>
> mysql> select * from test;
> +---------------------------------------------------------------+
> | tekstas |
> +---------------------------------------------------------------+
> | Premjera bus gruodžio 3 dien?, 19 val. "Forum Cinemas Vingis" |
> +---------------------------------------------------------------+
> 1 row in set (0.00 sec)
>
> You can see that now the conversion is done CORRECTLY, but only ž is
> converted. That's because code point 0x017E (ž) exist in Windows-1257 and
> latin1 (Windows-1252) but code point 0x0105 (ą) does not.
>
> *Conslusion: you have to set mysql connection (and use SET NAMES)
> if you are using characters that are not in Windows-1252.*
>
> You could ask why conversion of UTF-8 to latin1 works then?
> Well, that's because every UTF-8 byte taken separately is in latin1 table :)
>
>>> 2. It confuses and irritates administrators as they spent hours
>>> debuging why the same text is shown ok in one application but badly
>>> in the second. And I must remind you that administrators are these
>>> people that bring Horde to users... At least most of the time.
>>>
>>> 3. Sorting of data is inefficient and in bad order. For example
>>> Turba contacts with special characters in their surnames.
>
>> Sorting is done by collations, not by charsets.
>
>> Jan.
>
> Yes, but Horde_Db doesn't set it either.
>
> One more thing to note when testing. ALTER TABLE statement will try
> to convert already inserted data. So if you want old data first
> convert it to binary, and only then to needed charset.
>
> Hope this helps some :)
Some other useful food for thought...
http://www.blueboxgrp.com/news/2009/07/mysql_encoding
...I think I'm actually starting to understand this.
--
Mike
More information about the dev
mailing list