Bug 1475 - zh_CN.gbk XLC_LOCALE file corrupt.
Summary: zh_CN.gbk XLC_LOCALE file corrupt.
Alias: None
Product: xorg
Classification: Unclassified
Component: Lib/Xlib (show other bugs)
Version: unspecified
Hardware: x86 (IA32) All
: high normal
Assignee: Jim Gettys
QA Contact:
Depends on:
Reported: 2004-09-26 20:58 UTC by Xie Qian
Modified: 2011-10-15 17:23 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Description Xie Qian 2004-09-26 20:58:42 UTC
This is also forward from XFree86's bugzilla, it was originally report by Su 
Yong <yoyosu@ustc.edu.cn> .

GBK <-> COMPOUND_TEXT translation in XFree86 is incorrect.

I started a project `mule-gbk' which aims to enable Chinese GBK 
encoding support(GBK support is important to the people from 
People's Republic of China.) on Emacs21.3/Mule, a few year ago.
In the process of enabling X selection between Emacs21 and other
applications on X11, I found the bug.

Normal X11 applications do GBK <-> COMPOUND_TEXT translation 
in Inter-Client Communication of X Selection with each
others use the routines from the xlib. But Emacs/Mule's 
COMPOUND_TEXT translation is implemented in Emacs Lisp. The point
is, if they(Mule & xlib) both encode GBK into COMPOUND_TEXT
correctly, there was no difficulty in the ICC of X Selection. 
But the experiments shows that Emacs/Mule can't understand the 
ctext translated from GBK text by normal X11 apps, like gedit,
mozilla, crxvt, etc. When you paste GBK text form these apps
to Emacs, the breakon sequence appeares "...GBK-0...".
Note that my locale is set to zh_CN.GBK by
  export LANG=zh_CN
  export LC_ALL=zh_CN.GBK
and the locale `zh_CN.GBK' has been generated on my Debian
GNU/Linux box by
  dpkg-reconfigure locales

The version of my XFree86 is 4.3.0.

Because it's so boring to me, I started to analyze the message from
the normal X11 apps, by inserting debugging statements into the 
clipboard program `xclip'. I found ctext from the normal X11 apps 
contains redundant sequences, it also makes wrong value of the
character counter in the `extended segments' of the ctext.

According to the document `Compound Text Encoding':
| 6.  Non-Standard Character Set Encodings
| Character set encodings that are not in the list of approved
| standard encodings can be included using ``extended seg-
| ments''.  An extended segment begins with one of the follow-
| ing sequences:
|      01/11 02/05 02/15 03/00 M L   variable number of octets per character
|      01/11 02/05 02/15 03/01 M L   1 octet per character
|      01/11 02/05 02/15 03/02 M L   2 octets per character
|      01/11 02/05 02/15 03/03 M L   3 octets per character
|      01/11 02/05 02/15 03/04 M L   4 octets per character
| [This uses the ``other coding system'' of ISO 2022, using
| private Final characters.]
| The ``M'' and ``L'' octets represent a 14-bit unsigned value
| giving the number of octets that appear in the remainder of
| the segment.  The number is computed as ((M - 128) * 128) +
| (L - 128).  The most significant bit M and L are always set
| to one.  The remainder of the segment consists of two parts,
| the name of the character set encoding and the actual text.
| The name of the encoding comes first and is separated from
| the text by the octet 00/02 (STX, START OF TEXT).  Note that
| the length defined by M and L includes the encoding name and
| separator.
extended segment in ctext for GBK text is defined as
    01/11 02/05 02/15 03/02 M L ,
because GBK is a non-standard character set with 2 octets 
per character.

Now, I found a simple method to solve this problem on my Debian GNU/Linux
Sid by modifying a line in the system file of XFree86:
The line:
 ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32\x80\x88\x47\x42\x4b\x2d\x30\x02
                           (These are   128,128+8,'G','B','K','-','0', 2,
                           where "GBK-0" may be the character name for GBK.
                           It's so strange question how they goes here?
                           May be due to the misunderstanding
                           of the author to `Compound Text Encoding'??)
should be changed into(equivalently, remove these 8 octets):
 ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32
             (This is exactly the first 4 octets of the 
             extended sequences defined in `Compound Text Encoding'.)

Till now, this method has been used by many mule-gbk users from P.R.C.

How ever, I don't know the explicit meaning of this line, maybe an 
Xpert can figure out :(

I have download
(This file is untouched for 3 years), and made a patch for it:

*** zh_CN.gbk.orig	2004-05-06 23:33:06.000000000 +0800
--- zh_CN.gbk	2004-05-06 23:34:31.000000000 +0800
*** 62,68 ****
  	byte2		\x40,\x7e;\x80,\xfe
  	wc_encoding	\x00008000
! 	ct_encoding	GBK-0:GLGR:\x1b\x25\x2f\x32\x80\x88\x47\x42\x4b\x2d\x30
  	mb_conversion	[\x8140,\xfefe]->\x0140
  	ct_conversion	[\x0140,\x7efe]->\x8140
--- 62,68 ----
  	byte2		\x40,\x7e;\x80,\xfe
  	wc_encoding	\x00008000
! 	ct_encoding	GBK-0:GLGR:\x1b\x25\x2f\x32
  	mb_conversion	[\x8140,\xfefe]->\x0140
  	ct_conversion	[\x0140,\x7efe]->\x8140

SU Yong
Comment 1 Bugzilla Maintainer 2004-09-27 07:52:10 UTC
I personally have no understanding of these issues.

Markus, however, may, so I've cc'ed him.
Comment 2 Markus Kuhn 2004-09-27 09:21:32 UTC
I'm not an expert on this either. GBK is explained excellently in Ken Lunde's
book "CJKV Information Processing" (ISBN 1-56592-224-7). GBK is an extension of
the GB2312 (EUC-CN) encoding that covers all the Chinese characters from
Unicode, in a way sort-of backwards compatible with GB2312. Like GB2312, GBK is
a double-byte encoding, where the second byte can be from the range 0x40-0xff.
As such, it does not follow the ISO 2022 standard (which allows only 0x21-0x7e
and 0xa0-0xff), and therefore anyone who wants to stuff GBK into a CTEXT string
has to prefix it with one of the 1b 25 2f 3x ll ll nn nn nn nn 02 ...
ESC-sequences defined in the compound text standard section 6, where ll ll is a
length-indicator for the entire string and encoding name, and nn nn is a name
for the encoding.

How exactly all this is specified in the file format of xc/nls/XLC_LOCALE/*, I
don't know. Where is this defined? The original ct_encoding sequence quoted here
includes a length indicator that allows only a single GBK character to be
attached to the prefix, which looks like a quite ugly hack to me. The
replacement sequence seems to specify the non-standard encoding name (here:
GBK-0) earlier in the line, so if that causes the library to actually add the
correct length indicator bytes, that certainly looks better to me.

Is it possible that the syntax/semantics of ct_encoding changes at some point
along the years, to actually add the length bytes and encoding name itself, and
this locale file was simply forgotten to be updated?
Comment 3 Xie Qian 2004-09-27 18:46:04 UTC
As Markus described above, processed as a non-standard charset encoding, GBK 
should use "1b 25 2f 3x ll ll nn nn nn nn 02" style ESC-sequence in CTEXT. In 
xlib implementation, the "ll ll nn nn nn nn 02" part is automatically attached 
according to charset name. So the definetion in XLC_LOCALE file should be 
just "1b 25 2f 3x", nothing more.

BTW, the length indicated in "ll ll" part includes itself, so there is no 
length restriction on the following characters. However, this sequence should 
be generate dynamicly, not defined in XLC_LOCALE file.

This bug may be hidden by a Redhat's patch aiming at gb18030 support, in which 
a definetion of GBK-0 is added to default_ct_data array, thus the ct_encoding 
in XLC_LOCALE file is overrided. I didn't check xorg's source code, but as I 
know, xorg doesn't includes gb18030 support, so I guess this will cause some 
serious problems to GBK locale users.
Comment 4 Ienup Sung 2005-01-10 12:39:26 UTC
I just checked the following revision 1.3 and found the problem has been already
fixed as described by Xie:

Comment 5 Roland Mainz 2005-01-13 20:21:28 UTC
Ienup Sung wrote:
> I just checked the following revision 1.3 and found the problem has been 
> already fixed as described by Xie:
> http://cvs.freedesktop.org/xorg/xc/nls/XLC_LOCALE/zh_CN.gbk?rev=1.3&view=auto

Can this bug report marked as FIXED then or is there any issue left ?
Comment 6 Xie Qian 2005-01-13 20:39:41 UTC
That's it. Mark to FIXED.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.