[dev] [commits] Horde branch master updated. fa8d644f8ca59a43056e5eae668ff1699d36be24
Vilius Šumskas
vilius at lnk.lt
Fri Dec 17 23:25:52 UTC 2010
Sveiki,
Monday, December 13, 2010, 6:45:55 PM, you wrote:
> Zitat von Vilius Šumskas <vilius at lnk.lt>:
>>> >> commit fa8d644f8ca59a43056e5eae668ff1699d36be24
>>> >> Author: Jan Schneider <jan at horde.org>
>>> >> Date: Tue Nov 30 15:41:59 2010 +0100
>>> >>
>>> >> Remove charset methods.
>>> >>
>>> >> They are only implemented in the MySQL drivers anyway and using SET
>>> >> NAMES causes more problems than it solves. Actually it breaks the
>>> >> current share driver.
>>> >
>>> > Does this mean that it will not be implemented and we are stuck with
>>> > ISO-8859-1 connection charsets forever? Horde is the only
>>> > professional application which doesn't use charsets correctly on DB
>>> > level.
>>>
>>> DB charset handling has always worked fine in Horde. I removed this
>>
>> True if you don't manage Horde's databases with external tools or if
>> you don't integrate it with other applications.
>>
>> As far as I remember there were dozen of questions in the mailling
>> list "why phpmyadmin displays my Horde database with wrong
>> characters".
> I have never experienced any issues with several generations of
> phpMyAdmin, whether the data was UTF-8 or Latin1, whether it was data
> from H3 or H4.
I'm looking at my "utf8" data from H3 right now with phpMyAdmin
2.11.11 and it shows plain Unicode characters instead of normal text.
Maybe you were using autorecode feature in phpMyAdmin?
>>> In the end, after we removed the SET NAMES call, everything started to
>>> work properly again, with old data, with new data, with UTF-8 data,
>>> with Latin1 data.
>>
>> By 'old' and 'new' you mean, you've changed db charset in Horde config?
> No, this means with older existing data from Horde 3 as well as new
> data created with Horde 4.
I have read the irc log and I'm missing some critical information
like server's configuration and such. But I remember that H3 had some
missing convertCharset methods in Share code. I saw it couple of
times. So if you added them later this is expected behaviour.
Also you cannot just *change* character set with SET NAMES without
converting the data itself and changing database/table/column
charsets. Note that I've wrote "change" here. That's because all versions
of MySQL above 4.1 will use any default character set set in configuration
anyway. Usually it's latin1 (which is actually Windows-1252 in MySQL
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html).
It doesn't matter if you use SET NAMES or not the conversion will be
done.
All the problems and confusion you are seeing is because old data was
incorrectly written into database. H3 data needs a migration for H4.
<../>
>> mysql> select length(tekstas) from test;
>> +-----------------+
>> | length(tekstas) |
>> +-----------------+
>> | 68 |
>> | 63 |
>> +-----------------+
>> 2 rows in set (0.00 sec)
>>
>> You can imagine how numbers differ if you have russian or other non
>> latin text.
> That's really a lame reason. Especially since no one forces you to use
> a multi-byte charset for the backend storage. If shortness of strings
> is more important for you than interoperability, you are still free to
> choose a latin charset or any other legacy charset instead of UTF-8.
As noted above this is not possible with MySQL 4.1 and later anymore.
Because MySQL does the translation in any case, but without correct
SET NAMES call it usually sets connection to latin1.
Michael correctly found that is will convert FROM
character_set_client TO character_set_connection. But also manual
mentions that conversion is done FROM character_set_connection TO
character_set_database (or table, or column).
So real conversion works this way for the input:
character_set_client -> character_set_connection -> character_set_database
And this way for the output (result):
character_set_database -> character_set_connection -> character_set_results
Let's take Lithuanian character set as an example. I have
created database "test" with table "test" and column in cp1257
charset. Also have set my PuTTY for 'Window-1257 Baltic'.
mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | cp1257 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
Query OK, 1 row affected (0.00 sec)
mysql> select * from test;
+---------------------------------------------------------------+
| tekstas |
+---------------------------------------------------------------+
| Premjera bus gruod?io 3 dien?, 19 val. "Forum Cinemas Vingis" |
+---------------------------------------------------------------+
1 row in set (0.00 sec)
That's because PuTTY sends everything in Windows-1257 but mysql client thinks
that the text is really latin1.
Now let's tell mysql client that text really comes in Windows-1257.
mysql> SET character_set_client = 'cp1257';
Query OK, 0 rows affected (0.00 sec)
mysql> SET character_set_results = 'cp1257';
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | cp1257 |
| character_set_connection | latin1 |
| character_set_database | cp1257 |
| character_set_filesystem | binary |
| character_set_results | cp1257 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
Query OK, 1 row affected (0.00 sec)
mysql> select * from test;
+---------------------------------------------------------------+
| tekstas |
+---------------------------------------------------------------+
| Premjera bus gruodžio 3 dien?, 19 val. "Forum Cinemas Vingis" |
+---------------------------------------------------------------+
1 row in set (0.00 sec)
You can see that now the conversion is done CORRECTLY, but only ž is
converted. That's because code point 0x017E (ž) exist in Windows-1257 and
latin1 (Windows-1252) but code point 0x0105 (ą) does not.
*Conslusion: you have to set mysql connection (and use SET NAMES)
if you are using characters that are not in Windows-1252.*
You could ask why conversion of UTF-8 to latin1 works then?
Well, that's because every UTF-8 byte taken separately is in latin1 table :)
>> 2. It confuses and irritates administrators as they spent hours
>> debuging why the same text is shown ok in one application but badly
>> in the second. And I must remind you that administrators are these
>> people that bring Horde to users... At least most of the time.
>>
>> 3. Sorting of data is inefficient and in bad order. For example
>> Turba contacts with special characters in their surnames.
> Sorting is done by collations, not by charsets.
> Jan.
Yes, but Horde_Db doesn't set it either.
One more thing to note when testing. ALTER TABLE statement will try
to convert already inserted data. So if you want old data first
convert it to binary, and only then to needed charset.
Hope this helps some :)
--
Best regards,
Vilius
More information about the dev
mailing list