[dev] [commits] Horde branch master updated. fa8d644f8ca59a43056e5eae668ff1699d36be24

Sat Dec 18 01:22:37 UTC 2010

On Dec 17, 2010, at 6:25 PM, Vilius Šumskas <vilius at lnk.lt> wrote:

> Sveiki,
> 
> Monday, December 13, 2010, 6:45:55 PM, you wrote:
> 
>> Zitat von Vilius Šumskas <vilius at lnk.lt>:
> 
>>>>>> commit fa8d644f8ca59a43056e5eae668ff1699d36be24
>>>>>> Author: Jan Schneider <jan at horde.org>
>>>>>> Date:   Tue Nov 30 15:41:59 2010 +0100
>>>>>> 
>>>>>>    Remove charset methods.
>>>>>> 
>>>>>>    They are only implemented in the MySQL drivers anyway and using SET
>>>>>>    NAMES causes more problems than it solves. Actually it breaks the
>>>>>>    current share driver.
>>>>> 
>>>>> Does this mean that it will not be implemented and we are stuck with
>>>>> ISO-8859-1 connection charsets forever? Horde is the only
>>>>> professional application which doesn't use charsets correctly on DB
>>>>> level.
>>>> 
>>>> DB charset handling has always worked fine in Horde. I removed this
>>> 
>>> True if you don't manage Horde's databases with external tools or if  
>>> you don't integrate it with other applications.
>>> 
>>> As far as I remember there were dozen of questions in the mailling  
>>> list "why phpmyadmin displays my Horde database with wrong  
>>> characters".
> 
>> I have never experienced any issues with several generations of  
>> phpMyAdmin, whether the data was UTF-8 or Latin1, whether it was data
>> from H3 or H4.
> 
> I'm  looking  at  my  "utf8"  data  from H3 right now with phpMyAdmin
> 2.11.11  and it shows plain Unicode characters instead of normal text.
> Maybe you were using autorecode feature in phpMyAdmin?
> 
>>>> In the end, after we removed the SET NAMES call, everything started to
>>>> work properly again, with old data, with new data, with UTF-8 data,
>>>> with Latin1 data.
>>> 
>>> By 'old' and 'new' you mean, you've changed db charset in Horde config?
> 
>> No, this means with older existing data from Horde 3 as well as new  
>> data created with Horde 4.
> 
> I  have  read  the  irc log and I'm missing some critical information
> like server's configuration and such. But I remember that H3 had some
> missing  convertCharset  methods  in  Share  code.  I saw it couple of
> times. So if you added them later this is expected behaviour.
> 
> Also   you  cannot  just *change* character set with SET NAMES without
> converting   the   data   itself  and  changing  database/table/column
> charsets.  Note that I've wrote "change" here. That's because all versions
> of MySQL above 4.1 will use any default character set set in configuration
> anyway. Usually it's latin1 (which is actually Windows-1252 in MySQL
> http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html).
> It  doesn't  matter if you use SET NAMES or not the conversion will be
> done.
> 
> All  the  problems and confusion you are seeing is because old data was
> incorrectly written into database. H3 data needs a migration for H4.
> 
> <../>
> 
>>> mysql> select length(tekstas) from test;
>>> +-----------------+
>>> | length(tekstas) |
>>> +-----------------+
>>> |              68 |
>>> |              63 |
>>> +-----------------+
>>> 2 rows in set (0.00 sec)
>>> 
>>> You can imagine how numbers differ if you have russian or other non  
>>> latin text.
> 
>> That's really a lame reason. Especially since no one forces you to use
>> a multi-byte charset for the backend storage. If shortness of strings
>> is more important for you than interoperability, you are still free to
>> choose a latin charset or any other legacy charset instead of UTF-8.
> 
> As noted above this is not possible with  MySQL 4.1 and later anymore.
> Because  MySQL  does  the translation in any case, but without correct
> SET  NAMES call it usually sets connection to latin1.
> 
> Michael    correctly    found    that   is   will   convert   FROM
> character_set_client  TO  character_set_connection.  But  also  manual
> mentions  that  conversion  is  done  FROM character_set_connection TO
> character_set_database   (or  table,  or column).
> 
> So real conversion works this way for the input:
> 
> character_set_client  ->  character_set_connection -> character_set_database
> 
> And this way for the output (result):
> 
> character_set_database -> character_set_connection -> character_set_results
> 
> Let's take Lithuanian character set as an example. I have
> created  database  "test"  with  table "test" and column in cp1257
> charset. Also have set my PuTTY for 'Window-1257 Baltic'.
> 
> mysql> show variables like 'char%';
> +--------------------------+----------------------------+
> | Variable_name            | Value                      |
> +--------------------------+----------------------------+
> | character_set_client     | latin1                     |
> | character_set_connection | latin1                     |
> | character_set_database   | cp1257                     |
> | character_set_filesystem | binary                     |
> | character_set_results    | latin1                     |
> | character_set_server     | utf8                       |
> | character_set_system     | utf8                       |
> | character_sets_dir       | /usr/share/mysql/charsets/ |
> +--------------------------+----------------------------+
> 8 rows in set (0.00 sec)
> 
> mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
> Query OK, 1 row affected (0.00 sec)
> 
> mysql> select * from test;
> +---------------------------------------------------------------+
> | tekstas                                                       |
> +---------------------------------------------------------------+
> | Premjera bus gruod?io 3 dien?, 19 val. "Forum Cinemas Vingis" |
> +---------------------------------------------------------------+
> 1 row in set (0.00 sec)
> 
> That's because PuTTY sends everything in Windows-1257 but mysql client thinks
> that the text is really latin1.
> 
> Now let's tell mysql client that text really comes in Windows-1257.
> 
> mysql> SET character_set_client = 'cp1257';
> Query OK, 0 rows affected (0.00 sec)
> mysql> SET character_set_results = 'cp1257';
> Query OK, 0 rows affected (0.00 sec)
> 
> mysql> show variables like 'char%';
> +--------------------------+----------------------------+
> | Variable_name            | Value                      |
> +--------------------------+----------------------------+
> | character_set_client     | cp1257                     |
> | character_set_connection | latin1                     |
> | character_set_database   | cp1257                     |
> | character_set_filesystem | binary                     |
> | character_set_results    | cp1257                     |
> | character_set_server     | utf8                       |
> | character_set_system     | utf8                       |
> | character_sets_dir       | /usr/share/mysql/charsets/ |
> +--------------------------+----------------------------+
> 8 rows in set (0.00 sec)
> 
> mysql> insert into test set tekstas='Premjera bus gruodžio 3 dieną, 19 val. "Forum Cinemas Vingis"';
> Query OK, 1 row affected (0.00 sec)
> 
> mysql> select * from test;
> +---------------------------------------------------------------+
> | tekstas                                                       |
> +---------------------------------------------------------------+
> | Premjera bus gruodžio 3 dien?, 19 val. "Forum Cinemas Vingis" |
> +---------------------------------------------------------------+
> 1 row in set (0.00 sec)
> 
> You  can  see that now the conversion is done CORRECTLY, but only ž is
> converted.  That's because code point 0x017E (ž) exist in Windows-1257 and
> latin1 (Windows-1252) but code point 0x0105 (ą) does not.
> 
> *Conslusion:  you  have  to  set  mysql  connection (and use SET NAMES)
> if you are using characters that are not in Windows-1252.*
> 
> You  could  ask  why  conversion    of   UTF-8 to latin1 works then?
> Well, that's because every UTF-8 byte taken separately is in latin1 table :)
> 
>>> 2. It confuses and irritates administrators as they spent hours
>>> debuging why the same text is shown ok in one application but badly  
>>> in the second. And I must remind you that administrators are these  
>>> people that bring Horde to users... At least most of the time.
>>> 
>>> 3. Sorting of data is inefficient and in bad order. For example  
>>> Turba contacts with special characters in their surnames.
> 
>> Sorting is done by collations, not by charsets.
> 
>> Jan.
> 
> Yes,  but  Horde_Db doesn't set it either.
> 
> One  more thing to note when testing. ALTER  TABLE  statement will try
> to  convert  already  inserted  data.  So  if  you want old data first
> convert it to binary, and only then to needed charset.
> 
> Hope this helps some :)

Some other useful food for thought...

http://www.blueboxgrp.com/news/2009/07/mysql_encoding

...I think I'm actually starting to understand this.

--
Mike