Bug 5129 - Non-unique entries in en_US.UTF-8 compose file
Summary: Non-unique entries in en_US.UTF-8 compose file
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Lib/Xlib (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: high normal
Assignee: Matthias Hopf
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-23 04:54 UTC by Simos Xenitellis
Modified: 2007-12-21 06:05 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Patch to remove conflict between cent/colon sequence (882 bytes, patch)
2006-04-18 13:55 UTC, Simos Xenitellis
no flags Details | Splinter Review
Updated Compose.pre with 0x10000000 added to Unicode keysyms. (550.19 KB, text/plain)
2006-04-18 19:34 UTC, Simos Xenitellis
no flags Details
libX11/nls/en_US.UTF-8/Compose.pre.diff (541.03 KB, patch)
2007-10-30 10:01 UTC, Mike FABIAN
no flags Details | Splinter Review

Description Simos Xenitellis 2005-11-23 04:54:08 UTC
The Compose file en_US.UTF-8 has some conflicts (same compose sequence,
different resulting character).

These are conflicts, that is, there are two compose sequences in the same
Compose file that produce the same character.

A. WARNING: Same keysyms for
  1: LATIN_CAPITAL_LETTER_O_WITH_MACRON_AND_ACUTE and
  2: GREEK_UPSILON_WITH_ACUTE_AND_HOOK_SYMBOL
  (GDK_Multi_key, GDK_apostrophe, GDK_Omacron, 0, 0)

<Multi_key> <apostrophe> <Omacron>      : "Ṓ" U1E52 # LATIN CAPITAL LETTER O
WITH MACRON AND ACUTE
<Multi_key> <apostrophe> <U03d2>        : "ϓ" U03D3 # GREEK UPSILON WITH ACUTE
AND HOOK SYMBOL

The issue is that Omacron has the value of "0x03d2" which conflicts with a
character from the Greek Unicode Block.

B. WARNING: Same keysyms for
  1: COLON_SIGN and
  2: CENT_SIGN
  (GDK_Multi_key, GDK_slash, GDK_C, 0, 0)

<Multi_key> <slash> <C>                 : "₡" U20a1 # COLON SIGN
<Multi_key> <slash> <C>                 : "¢" U00A2 # CENT SIGN

C. WARNING: Same keysyms for
  1: COLON_SIGN and
  2: CENT_SIGN
  (GDK_Multi_key, GDK_C, GDK_slash, 0, 0)

<Multi_key> <C> <slash>                 : "₡" U20a1 # COLON SIGN
<Multi_key> <C> <slash>                 : "¢" U00A2 # CENT SIGN

D. WARNING: Same keysyms for
  1: LATIN_CAPITAL_LETTER_O_WITH_MACRON_AND_ACUTE and
  2: GREEK_UPSILON_WITH_ACUTE_AND_HOOK_SYMBOL
  (GDK_Multi_key, GDK_acute, GDK_Omacron, 0, 0)

<Multi_key> <acute> <Omacron>   : "Ṓ" U1E52 # LATIN CAPITAL LETTER O WITH MACRON
AND ACUTE
<Multi_key> <acute> <U03d2>     : "ϓ" U03D3 # GREEK UPSILON WITH ACUTE AND HOOK
SYMBOL
Comment 1 James Cloos 2005-11-24 06:25:15 UTC
Interesting.

Wrt A&D, <Omacron> should not conflict with <U03d2>; <U03d2> should be
         defined as 0x10003d2, not as 0x03d2, ya?  Is this instead a
         bug in the code that generated the list of conflicts?

Wrt B&C, cent and colon have the obvious fix of using /c and c/ for ¢ and
         using /C and C/for ₡.  I bet most users would expect cents to use
         a miniscule c and colon a majuscule anyway.
Comment 2 Simos Xenitellis 2005-11-24 07:27:12 UTC
(In reply to comment #1)
> Interesting.
> 
> Wrt A&D, <Omacron> should not conflict with <U03d2>; <U03d2> should be
>          defined as 0x10003d2, not as 0x03d2, ya?  Is this instead a
>          bug in the code that generated the list of conflicts?

"Omacron" is defined in
http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h?view=markup
In particular: 
#define XK_Omacron                       0x03d2  /* U+014C LATIN CAPITAL LETTER
O WITH MACRON */

Therefore, it's a problem in keysymdef.h.
It would make sense to me Omacron to have the value of th 0x014C. How come does
it have a different value? 
If for some reason it needs to be offsetted by 0x1000000, could you please add a
patch?

> 
> Wrt B&C, cent and colon have the obvious fix of using /c and c/ for ¢ and
>          using /C and C/for â&#65533;¡.  I bet most users would expect cents to use
>          a miniscule c and colon a majuscule anyway.

Could you please patch this up as well?
Comment 3 Daniel Stone 2006-04-13 22:16:23 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Interesting.
> > 
> > Wrt A&D, <Omacron> should not conflict with <U03d2>; <U03d2> should be
> >          defined as 0x10003d2, not as 0x03d2, ya?  Is this instead a
> >          bug in the code that generated the list of conflicts?
> 
> "Omacron" is defined in
> http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h?view=markup
> In particular: 
> #define XK_Omacron                       0x03d2  /* U+014C LATIN CAPITAL LETTER
> O WITH MACRON */
> 
> Therefore, it's a problem in keysymdef.h.
> It would make sense to me Omacron to have the value of th 0x014C. How come does
> it have a different value? 
> If for some reason it needs to be offsetted by 0x1000000, could you please add a
> patch?

Existing legacy keysyms are considered part of the protocol and will not -- not
-- be changed.  Before Unicode came around, characters like Omacron were defined
with arbitrary keysyms (e.g. 0x03D2).  Now Unicode has come, we can all use
that, but we cannot change the value of legacy keysyms.  So, if you want to use
a Unicode value for which no keysym is defined, you offset it by 0x10000000. 
So, U03D2 is 0x100003D2.  Omacron is 0x000003D2.  So there's no conflict.

> > Wrt B&C, cent and colon have the obvious fix of using /c and c/ for ¢ and
> >          using /C and C/for �&#65533;�.  I bet most users would expect cents
to use
> >          a miniscule c and colon a majuscule anyway.
> 
> Could you please patch this up as well?

Please submit a diff to the Compose file with the results of whatever you come
up with.  I think you're best-placed to deal with this kind of thing.
Comment 4 Simos Xenitellis 2006-04-18 08:27:05 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Interesting.
> > > 
> > > Wrt A&D, <Omacron> should not conflict with <U03d2>; <U03d2> should be
> > >          defined as 0x10003d2, not as 0x03d2, ya?  Is this instead a
> > >          bug in the code that generated the list of conflicts?
> > 
> > "Omacron" is defined in
> > http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h?view=markup
> > In particular: 
> > #define XK_Omacron                       0x03d2  /* U+014C LATIN CAPITAL LETTER
> > O WITH MACRON */
> > 
> > Therefore, it's a problem in keysymdef.h.
> > It would make sense to me Omacron to have the value of th 0x014C. How come does
> > it have a different value? 
> > If for some reason it needs to be offsetted by 0x1000000, could you please add a
> > patch?
> 
> Existing legacy keysyms are considered part of the protocol and will not -- not
> -- be changed.  Before Unicode came around, characters like Omacron were defined
> with arbitrary keysyms (e.g. 0x03D2).  Now Unicode has come, we can all use
> that, but we cannot change the value of legacy keysyms.  So, if you want to use
> a Unicode value for which no keysym is defined, you offset it by 0x10000000. 
> So, U03D2 is 0x100003D2.  Omacron is 0x000003D2.  So there's no conflict.
> 
> > > Wrt B&C, cent and colon have the obvious fix of using /c and c/ for ¢ and
> > >          using /C and C/for �&#65533;�.  I bet most users would expect cents
> to use
> > >          a miniscule c and colon a majuscule anyway.
> > 
> > Could you please patch this up as well?
> 
> Please submit a diff to the Compose file with the results of whatever you come
> up with.  I think you're best-placed to deal with this kind of thing.

Thus, is the request for a patch that will add 0x100000 (if not already added)
to all Unicode keysyms in the Compose file?

I am happy to do it, and specifically provide a script for this, as the patch
will be very big.
Comment 5 Daniel Stone 2006-04-18 09:03:34 UTC
sure, adding all the missing entries sounds fine (though to resolve *this* bug,
one would still need to remove some of the multiple definitions, no?).  i just
need something that I can apply the result of to the tree.
Comment 6 Simos Xenitellis 2006-04-18 13:55:45 UTC
Created attachment 5353 [details] [review]
Patch to remove conflict between cent/colon sequence

Multi_key + C + slash = Colon
Multi_key + C + slash = cent
Multi_key + slash + C = Colon
Multi_key + slash + C = cent

These are conflicts, so, following discussion with Daniel, we remove the option
for C (capital C) + slash to produce cent. For consistency, we do the same for
C + bar.
Comment 7 Simos Xenitellis 2006-04-18 19:30:28 UTC
(In reply to comment #5)
> sure, adding all the missing entries sounds fine (though to resolve *this* bug,
> one would still need to remove some of the multiple definitions, no?).  i just
> need something that I can apply the result of to the tree.

Quite luckily, the "conflicts" are only those shown in this bug report which are
very few. As conflict I describe the situation where two same sequences produce
different characters. Due to this, one of the two sequences is not available at
all to the end user. 

There are also "multiple" definitions, that is, two different sequences
producing the same character. It's kind of redundancy. These are not fatal and
can be accepted. The Hebrew section has several of those.

Comment 8 Simos Xenitellis 2006-04-18 19:34:50 UTC
Created attachment 5357 [details]
Updated Compose.pre with 0x10000000 added to Unicode keysyms.

This is the updated Compose.pre.
Due to the changes in the spaces between the fields, there is no benefit  to
provide the patch.

I reapplied the Unicode names of the characters from UnicodeData.txt
Comment 9 Daniel Stone 2006-06-01 15:50:10 UTC
committed to git
Comment 10 Simos Xenitellis 2006-06-01 16:20:08 UTC
(In reply to comment #9)
> committed to git

Is that both patches?
1. https://bugs.freedesktop.org/attachment.cgi?id=5353
2. https://bugs.freedesktop.org/attachment.cgi?id=5357
Comment 11 Daniel Stone 2006-06-01 16:48:10 UTC
no, because #5357 is a complete file, and AFAICT doesn't have the problem #5353
was fixing
Comment 12 Mike FABIAN 2007-10-30 06:02:28 UTC
Daniel Stone> So, if you want to use a Unicode value for which 
Daniel Stone> no keysym is defined, you offset it by 0x10000000. 

Here is a typo. The offset is 0x01000000!

Daniel Stone> So, U03D2 is 0x100003D2.  Omacron is 0x000003D2.  

Again the same typo. U03D2 is 0x010003D2.

Daniel Stone> So there's no conflict.

Yes.
Comment 13 Mike FABIAN 2007-10-30 06:12:05 UTC
The entries using <U100xxxxx> in the Compose file attached in comment #8
are wrong. They just don’t work. <U000xxxxx> is correct. 

Comment 14 Mike FABIAN 2007-10-30 06:17:13 UTC
Simon Xenitellis> These are conflicts, that is, there are two 
Simon Xenitellis> compose sequences in the same Compose 
Simon Xenitellis> file that produce the same character.

Simon Xenitellis> A. WARNING: Same keysyms for
Simon Xenitellis>   1: LATIN_CAPITAL_LETTER_O_WITH_MACRON_AND_ACUTE and
Simon Xenitellis>   2: GREEK_UPSILON_WITH_ACUTE_AND_HOOK_SYMBOL
Simon Xenitellis>   (GDK_Multi_key, GDK_apostrophe, GDK_Omacron, 0, 0)

Simon Xenitellis> <Multi_key> <apostrophe> <Omacron>      : "Ṓ" U1E52 # LATIN CAPITAL LETTER O WITH MACRON AND ACUTE
Simon Xenitellis> <Multi_key> <apostrophe> <U03d2>        : "ϓ" U03D3 # GREEK UPSILON WITH ACUTE AND HOOK SYMBOL


I don’t see a conflict here. I can have both entries in the Compose file
and both of them work. The keysyms do not conflict. 


Comment 15 Mike FABIAN 2007-10-30 06:19:11 UTC
For example, if I add a like like 

    keysym 5 = 5 percent 0x010003d3 Omacron

to my ~/.Xmodmap for testing, I can type U+03D3 with AltGr+5 and Omacron
with Shift+AltGr+5. And using that, I verified that both Compose sequences
work and do *not* conflict.
Comment 16 Mike FABIAN 2007-10-30 09:25:21 UTC
see also 

 http://bugzilla.novell.com/show_bug.cgi?id=337760

for some more problems in the current Compose file.

Comment 17 Mike FABIAN 2007-10-30 10:00:10 UTC
The other problems mentioned in  

    http://bugzilla.novell.com/show_bug.cgi?id=337760

apart from the <U100xxxxx> -> <U100xxxxx> problem are apparently
already fixed in the latest git checkout.
Comment 18 Mike FABIAN 2007-10-30 10:01:19 UTC
Created attachment 12264 [details] [review]
libX11/nls/en_US.UTF-8/Compose.pre.diff

Patch against libX11/nls/en_US.UTF-8/Compose.pre
Comment 19 Simos Xenitellis 2007-11-01 06:02:07 UTC
(In reply to comment #15)
> For example, if I add a like like 
> 
>     keysym 5 = 5 percent 0x010003d3 Omacron
> 
> to my ~/.Xmodmap for testing, I can type U+03D3 with AltGr+5 and Omacron
> with Shift+AltGr+5. And using that, I verified that both Compose sequences
> work and do *not* conflict.
> 

Simon Xenitellis> <Multi_key> <apostrophe> <Omacron>      : "Ṓ" U1E52 # LATIN
CAPITAL LETTER O WITH MACRON AND ACUTE
Simon Xenitellis> <Multi_key> <apostrophe> <U03d2>        : "ϓ" U03D3 # GREEK
UPSILON WITH ACUTE AND HOOK SYMBOL

You mean at for the first line one would have to press

Multi_key + apostrophe +  AltGr + 5 : Ṓ

and for the second line

Multi_key + apostrophe +  Shift + AltGr + 5 : ϓ

and it works?

It would look strange to me if it worked but I believe we have different things in mind when discussing this.
Comment 20 Mike FABIAN 2007-11-05 05:58:01 UTC
Simos Xenitellis> You mean at for the first line one would have to press

Simos Xenitellis> Multi_key + apostrophe +  AltGr + 5 : Ṓ

Simos Xenitellis> and for the second line

Simos Xenitellis> Multi_key + apostrophe +  Shift + AltGr + 5 : ϓ

Simos Xenitellis> and it works?

Yes, exactly. 
Comment 21 Simos Xenitellis 2007-11-16 15:57:32 UTC
(In reply to comment #20)
> Simos Xenitellis> You mean at for the first line one would have to press
> Simos Xenitellis> Multi_key + apostrophe +  AltGr + 5 : Ṓ
> Simos Xenitellis> and for the second line 
> Simos Xenitellis> Multi_key + apostrophe +  Shift + AltGr + 5 : ϓ
> Simos Xenitellis> and it works?
> 
> Yes, exactly. 

I am not quite sure if the average user will be able to make it with these sequences.
Those "duplicates" came about when I wrote a script to convert the Xorg Compose file into a format that the GTK+ Input Method can recognise. In GTK+ IM, the above "duplicates" do not work; the second in the duplicate is always hidden.
The same Xorg Compose file is replicated in SCIM (Afaik), and my guess is that there this problem also exists.



Comment 22 Mike FABIAN 2007-11-21 03:36:57 UTC
Simos Xenitellis> I am not quite sure if the average user will be able Simos Xenitellis> to make it with these sequences.  That is a different problem from them being duplicates.  Simos Xenitellis> Those "duplicates" came about when I wrote a script Simos Xenitellis> to convert the Xorg Compose file into a format that Simos Xenitellis> the GTK+ Input Method can recognise. In GTK+ IM, the Simos Xenitellis> above "duplicates" do not work; the second in the Simos Xenitellis> duplicate is always hidden.  Isn’t this a GTK+ bug then?  Simos Xenitellis> The same Xorg Compose file is replicated in SCIM Simos Xenitellis> (Afaik), and my guess is that there this problem Simos Xenitellis> also exists.  SCIM (and, by the way, Qt as well) have the Compose file hardcoded. In case of SCIM, the auther of SCIM converted the Xorg Compose file to a the header file “scim_compose_key_data.h” which is compiled into SCIM.  Therefore, unfortunately, the Compose handling in Xorg and SCIM may slightly differ because the current “scim_compose_key_data.h” was created years ago and the Xorg Compose file has since been changed somewhat.  The author of SCIM wanted to hardcode this to make it fast. That might be a good idea.  Nevertheless it is not nice that this can cause subtle differences in the compose handling.  I am just working on an improvement to SCIM to parse the Xorg Compose file on the system where SCIM is compiled and compile that into SCIM. That is probably a reasonable compromise between speed and trying to make the Compose handling behave the same in Xorg and SCIM.
Comment 23 Mike FABIAN 2007-11-21 03:38:25 UTC
Simos Xenitellis> I am not quite sure if the average user will be able
Simos Xenitellis> to make it with these sequences.

That is a different problem from them being duplicates.

Simos Xenitellis> Those "duplicates" came about when I wrote a script
Simos Xenitellis> to convert the Xorg Compose file into a format that
Simos Xenitellis> the GTK+ Input Method can recognise. In GTK+ IM, the
Simos Xenitellis> above "duplicates" do not work; the second in the
Simos Xenitellis> duplicate is always hidden.

Isn’t this a GTK+ bug then?

Simos Xenitellis> The same Xorg Compose file is replicated in SCIM
Simos Xenitellis> (Afaik), and my guess is that there this problem
Simos Xenitellis> also exists.

SCIM (and, by the way, Qt as well) have the Compose file hardcoded.
In case of SCIM, the auther of SCIM converted the Xorg Compose file
to a the header file “scim_compose_key_data.h” which is compiled
into SCIM.

Therefore, unfortunately, the Compose handling in Xorg and SCIM
may slightly differ because the current “scim_compose_key_data.h”
was created years ago and the Xorg Compose file has since
been changed somewhat.

The author of SCIM wanted to hardcode this to make it fast.
That might be a good idea.

Nevertheless it is not nice that this can cause subtle differences in
the compose handling.  I am just working on an improvement to SCIM to
parse the Xorg Compose file on the system where SCIM is compiled and
compile that into SCIM. That is probably a reasonable compromise
between speed and trying to make the Compose handling behave the same
in Xorg and SCIM.

Comment 24 Mike FABIAN 2007-11-22 07:28:41 UTC
Please see also the original comment in
bug #11930:


Alexandros Diamantidis> The problem can be cured by globally replacing, in the Compose file,
Alexandros Diamantidis>    U10000313 --> U0313
Alexandros Diamantidis> and U10000314 --> U0314

*All* appearances of U1000xxxx should be replaced with U0000xxxx
(or, probably even better, the shorter Uxxxx. Both work. But U1000xxxx
does *not* work).

U100xxxx should be replaced with Uxxxxx (or U000xxxxx).

That is what my patch does.

Comment 25 Stefan Dirsch 2007-11-25 07:33:18 UTC
Matthias, please commit Mike's patch.
Comment 26 James Cloos 2007-12-04 14:30:36 UTC
The U1000XXXX → UXXXX and U1001XXXX → U1XXXX issue is fixed in commit 438d02ebc08ee171cf1d3936f4c81050d428ab92.

I believe that covers the last issues here?
Is this one ready to close?
Comment 27 Mike FABIAN 2007-12-05 01:24:20 UTC
James Cloos> I believe that covers the last issues here?
James Cloos> Is this one ready to close?

Yes, I also think this was the last issue here and this bug can be closed.
Comment 28 Matthias Hopf 2007-12-21 06:05:54 UTC
Verified. This is fixed in git.

Thanks Mike!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.