[Tickets #1621] non-ASCII 7-bit message headers not RFC2047-encoded

bugs at bugs.horde.org bugs at bugs.horde.org
Wed Mar 30 10:53:28 PST 2005


DO NOT REPLY TO THIS MESSAGE. THIS EMAIL ADDRESS IS NOT MONITORED.

Ticket URL: http://bugs.horde.org/ticket/?id=1621
-----------------------------------------------------------------------
 Ticket             | 1621
 Updated By         | Michael Slusarz <slusarz at mail.curecanti.org>
 Summary            | non-ASCII 7-bit message headers not RFC2047-encoded
 Queue              | IMP
 Version            | HEAD
 State              | Feedback
 Priority           | 2. Medium
 Type               | Bug
 Owners             | Michael Slusarz
-----------------------------------------------------------------------


Michael Slusarz <slusarz at mail.curecanti.org> (2005-03-30 10:53) wrote:

So my reply, which will attempt to battle yours for ignorance :)

I do understand that ISO-2022-JP is a 7-bit charset in that any individual
byte is in the range 00-7f (hex).  However, obviously, the charset uses the
presence of an escape character to indicate that consecutive bytes need to
be combined to properly form the character.

Therefore, it is my understanding that the mb_ereg_*() functions _should_
somehow be able to return a multibyte character when the non-charset
preg_*() functions will not.  Example:

String: ESCAPE_CHARACTER MB_CHAR_1 MB_CHAR_2

This string has three bytes.  All three bytes are in the range 00-7f. 
Therefore, doing a preg_*() match will result in this string appearing to be
3 7bit characters - thus, is8bit() will return false.

However, to mb_ereg()  this string should be interpreted as a single
character, two byte string. Therefore a search for 00-7f *should* fail since
the character is actually something more like 2e3f (hex).  Even though the
underlying string is entirely 7bit, mb_ereg() should be applying the regex
to the "actual" representation of the string.

All of this goes to tell me that it is probably an error with the regex
which is causing the multibyte character to not be recognized.  I would
think a regex like "/.{1}/" would match "ESCAPE_CHARACTER" for preg and
"japanese character" for ereg().  However, I haven't yet figured out a way
to do this in a single regex.  Anyone with ereg() style regex experience
that could chime in would be appreciated.




More information about the bugs mailing list