[dev] [commits] Horde branch master updated. fa8d644f8ca59a43056e5eae668ff1699d36be24

Fri Dec 17 23:25:52 UTC 2010

Sveiki,

Monday, December 13, 2010, 6:45:55 PM, you wrote:

> Zitat von Vilius Šumskas <vilius at lnk.lt>:

>>> >> commit fa8d644f8ca59a43056e5eae668ff1699d36be24
>>> >> Author: Jan Schneider <jan at horde.org>
>>> >> Date:   Tue Nov 30 15:41:59 2010 +0100
>>> >>
>>> >>     Remove charset methods.
>>> >>
>>> >>     They are only implemented in the MySQL drivers anyway and using SET
>>> >>     NAMES causes more problems than it solves. Actually it breaks the
>>> >>     current share driver.
>>> >
>>> > Does this mean that it will not be implemented and we are stuck with
>>> > ISO-8859-1 connection charsets forever? Horde is the only
>>> > professional application which doesn't use charsets correctly on DB
>>> > level.
>>>
>>> DB charset handling has always worked fine in Horde. I removed this
>>
>> True if you don't manage Horde's databases with external tools or if  
>> you don't integrate it with other applications.
>>
>> As far as I remember there were dozen of questions in the mailling  
>> list "why phpmyadmin displays my Horde database with wrong  
>> characters".

> I have never experienced any issues with several generations of  
> phpMyAdmin, whether the data was UTF-8 or Latin1, whether it was data
> from H3 or H4.

I'm  looking  at  my  "utf8"  data  from H3 right now with phpMyAdmin
2.11.11  and it shows plain Unicode characters instead of normal text.
Maybe you were using autorecode feature in phpMyAdmin?

>>> In the end, after we removed the SET NAMES call, everything started to
>>> work properly again, with old data, with new data, with UTF-8 data,
>>> with Latin1 data.
>>
>> By 'old' and 'new' you mean, you've changed db charset in Horde config?

> No, this means with older existing data from Horde 3 as well as new  
> data created with Horde 4.

I  have  read  the  irc log and I'm missing some critical information
like server's configuration and such. But I remember that H3 had some
missing  convertCharset  methods  in  Share  code.  I saw it couple of
times. So if you added them later this is expected behaviour.

Also   you  cannot  just *change* character set with SET NAMES without
converting   the   data   itself  and  changing  database/table/column
charsets.  Note that I've wrote "change" here. That's because all versions
of MySQL above 4.1 will use any default character set set in configuration
anyway. Usually it's latin1 (which is actually Windows-1252 in MySQL
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html).
It  doesn't  matter if you use SET NAMES or not the conversion will be
done.

All  the  problems and confusion you are seeing is because old data was
incorrectly written into database. H3 data needs a migration for H4.

<../>

>> mysql> select length(tekstas) from test;
>> +-----------------+
>> | length(tekstas) |
>> +-----------------+
>> |              68 |
>> |              63 |
>> +-----------------+
>> 2 rows in set (0.00 sec)
>>
>> You can imagine how numbers differ if you have russian or other non  
>> latin text.

> That's really a lame reason. Especially since no one forces you to use
> a multi-byte charset for the backend storage. If shortness of strings
> is more important for you than interoperability, you are still free to
> choose a latin charset or any other legacy charset instead of UTF-8.

As noted above this is not possible with  MySQL 4.1 and later anymore.
Because  MySQL  does  the translation in any case, but without correct
SET  NAMES call it usually sets connection to latin1.

Michael    correctly    found    that   is   will   convert   FROM
character_set_client  TO  character_set_connection.  But  also  manual
mentions  that  conversion  is  done  FROM character_set_connection TO
character_set_database   (or  table,  or column).

So real conversion works this way for the input:

character_set_client  ->  character_set_connection -> character_set_database

And this way for the output (result):

character_set_database -> character_set_connection -> character_set_results

Let's take Lithuanian character set as an example. I have
created  database  "test"  with  table "test" and column in cp1257
charset. Also have set my PuTTY for 'Window-1257 Baltic'.

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | cp1257                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
Query OK, 1 row affected (0.00 sec)

mysql> select * from test;
+---------------------------------------------------------------+
| tekstas                                                       |
+---------------------------------------------------------------+
| Premjera bus gruod?io 3 dien?, 19 val. "Forum Cinemas Vingis" |
+---------------------------------------------------------------+
1 row in set (0.00 sec)

That's because PuTTY sends everything in Windows-1257 but mysql client thinks
that the text is really latin1.

Now let's tell mysql client that text really comes in Windows-1257.

mysql> SET character_set_client = 'cp1257';
Query OK, 0 rows affected (0.00 sec)
mysql> SET character_set_results = 'cp1257';
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | cp1257                     |
| character_set_connection | latin1                     |
| character_set_database   | cp1257                     |
| character_set_filesystem | binary                     |
| character_set_results    | cp1257                     |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
Query OK, 1 row affected (0.00 sec)

mysql> select * from test;
+---------------------------------------------------------------+
| tekstas                                                       |
+---------------------------------------------------------------+
| Premjera bus gruodžio 3 dien?, 19 val. "Forum Cinemas Vingis" |
+---------------------------------------------------------------+
1 row in set (0.00 sec)

You  can  see that now the conversion is done CORRECTLY, but only ž is
converted.  That's because code point 0x017E (ž) exist in Windows-1257 and
latin1 (Windows-1252) but code point 0x0105 (ą) does not.

*Conslusion:  you  have  to  set  mysql  connection (and use SET NAMES)
if you are using characters that are not in Windows-1252.*

You  could  ask  why  conversion    of   UTF-8 to latin1 works then?
Well, that's because every UTF-8 byte taken separately is in latin1 table :)

>> 2. It confuses and irritates administrators as they spent hours
>> debuging why the same text is shown ok in one application but badly  
>> in the second. And I must remind you that administrators are these  
>> people that bring Horde to users... At least most of the time.
>>
>> 3. Sorting of data is inefficient and in bad order. For example  
>> Turba contacts with special characters in their surnames.

> Sorting is done by collations, not by charsets.

> Jan.

Yes,  but  Horde_Db doesn't set it either.

One  more thing to note when testing. ALTER  TABLE  statement will try
to  convert  already  inserted  data.  So  if  you want old data first
convert it to binary, and only then to needed charset.

Hope this helps some :)

-- 
Best regards,
 Vilius