This is also forward from XFree86's bugzilla, it was originally report by Su
Yong <firstname.lastname@example.org> .
GBK <-> COMPOUND_TEXT translation in XFree86 is incorrect.
I started a project `mule-gbk' which aims to enable Chinese GBK
encoding support(GBK support is important to the people from
People's Republic of China.) on Emacs21.3/Mule, a few year ago.
In the process of enabling X selection between Emacs21 and other
applications on X11, I found the bug.
Normal X11 applications do GBK <-> COMPOUND_TEXT translation
in Inter-Client Communication of X Selection with each
others use the routines from the xlib. But Emacs/Mule's
COMPOUND_TEXT translation is implemented in Emacs Lisp. The point
is, if they(Mule & xlib) both encode GBK into COMPOUND_TEXT
correctly, there was no difficulty in the ICC of X Selection.
But the experiments shows that Emacs/Mule can't understand the
ctext translated from GBK text by normal X11 apps, like gedit,
mozilla, crxvt, etc. When you paste GBK text form these apps
to Emacs, the breakon sequence appeares "...GBK-0...".
Note that my locale is set to zh_CN.GBK by
and the locale `zh_CN.GBK' has been generated on my Debian
GNU/Linux box by
The version of my XFree86 is 4.3.0.
Because it's so boring to me, I started to analyze the message from
the normal X11 apps, by inserting debugging statements into the
clipboard program `xclip'. I found ctext from the normal X11 apps
contains redundant sequences, it also makes wrong value of the
character counter in the `extended segments' of the ctext.
According to the document `Compound Text Encoding':
| 6. Non-Standard Character Set Encodings
| Character set encodings that are not in the list of approved
| standard encodings can be included using ``extended seg-
| ments''. An extended segment begins with one of the follow-
| ing sequences:
| 01/11 02/05 02/15 03/00 M L variable number of octets per character
| 01/11 02/05 02/15 03/01 M L 1 octet per character
| 01/11 02/05 02/15 03/02 M L 2 octets per character
| 01/11 02/05 02/15 03/03 M L 3 octets per character
| 01/11 02/05 02/15 03/04 M L 4 octets per character
| [This uses the ``other coding system'' of ISO 2022, using
| private Final characters.]
| The ``M'' and ``L'' octets represent a 14-bit unsigned value
| giving the number of octets that appear in the remainder of
| the segment. The number is computed as ((M - 128) * 128) +
| (L - 128). The most significant bit M and L are always set
| to one. The remainder of the segment consists of two parts,
| the name of the character set encoding and the actual text.
| The name of the encoding comes first and is separated from
| the text by the octet 00/02 (STX, START OF TEXT). Note that
| the length defined by M and L includes the encoding name and
extended segment in ctext for GBK text is defined as
01/11 02/05 02/15 03/02 M L ,
because GBK is a non-standard character set with 2 octets
Now, I found a simple method to solve this problem on my Debian GNU/Linux
Sid by modifying a line in the system file of XFree86:
(These are 128,128+8,'G','B','K','-','0', 2,
where "GBK-0" may be the character name for GBK.
It's so strange question how they goes here?
May be due to the misunderstanding
of the author to `Compound Text Encoding'??)
should be changed into(equivalently, remove these 8 octets):
(This is exactly the first 4 octets of the
extended sequences defined in `Compound Text Encoding'.)
Till now, this method has been used by many mule-gbk users from P.R.C.
How ever, I don't know the explicit meaning of this line, maybe an
Xpert can figure out :(
I have download
(This file is untouched for 3 years), and made a patch for it:
*** zh_CN.gbk.orig 2004-05-06 23:33:06.000000000 +0800
--- zh_CN.gbk 2004-05-06 23:34:31.000000000 +0800
*** 62,68 ****
! ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32\x80\x88\x47\x42\x4b\x2d\x30
--- 62,68 ----
! ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32
I personally have no understanding of these issues.
Markus, however, may, so I've cc'ed him.
I'm not an expert on this either. GBK is explained excellently in Ken Lunde's
book "CJKV Information Processing" (ISBN 1-56592-224-7). GBK is an extension of
the GB2312 (EUC-CN) encoding that covers all the Chinese characters from
Unicode, in a way sort-of backwards compatible with GB2312. Like GB2312, GBK is
a double-byte encoding, where the second byte can be from the range 0x40-0xff.
As such, it does not follow the ISO 2022 standard (which allows only 0x21-0x7e
and 0xa0-0xff), and therefore anyone who wants to stuff GBK into a CTEXT string
has to prefix it with one of the 1b 25 2f 3x ll ll nn nn nn nn 02 ...
ESC-sequences defined in the compound text standard section 6, where ll ll is a
length-indicator for the entire string and encoding name, and nn nn is a name
for the encoding.
How exactly all this is specified in the file format of xc/nls/XLC_LOCALE/*, I
don't know. Where is this defined? The original ct_encoding sequence quoted here
includes a length indicator that allows only a single GBK character to be
attached to the prefix, which looks like a quite ugly hack to me. The
replacement sequence seems to specify the non-standard encoding name (here:
GBK-0) earlier in the line, so if that causes the library to actually add the
correct length indicator bytes, that certainly looks better to me.
Is it possible that the syntax/semantics of ct_encoding changes at some point
along the years, to actually add the length bytes and encoding name itself, and
this locale file was simply forgotten to be updated?
As Markus described above, processed as a non-standard charset encoding, GBK
should use "1b 25 2f 3x ll ll nn nn nn nn 02" style ESC-sequence in CTEXT. In
xlib implementation, the "ll ll nn nn nn nn 02" part is automatically attached
according to charset name. So the definetion in XLC_LOCALE file should be
just "1b 25 2f 3x", nothing more.
BTW, the length indicated in "ll ll" part includes itself, so there is no
length restriction on the following characters. However, this sequence should
be generate dynamicly, not defined in XLC_LOCALE file.
This bug may be hidden by a Redhat's patch aiming at gb18030 support, in which
a definetion of GBK-0 is added to default_ct_data array, thus the ct_encoding
in XLC_LOCALE file is overrided. I didn't check xorg's source code, but as I
know, xorg doesn't includes gb18030 support, so I guess this will cause some
serious problems to GBK locale users.
I just checked the following revision 1.3 and found the problem has been already
fixed as described by Xie:
Ienup Sung wrote:
> I just checked the following revision 1.3 and found the problem has been
> already fixed as described by Xie:
Can this bug report marked as FIXED then or is there any issue left ?
That's it. Mark to FIXED.