Bug 64173

Summary:	PIM: hard-code collation (Pinyin, phonebook)
Product:	SyncEvolution	Reporter:	Patrick Ohly <patrick.ohly>
Component:	PIM Manager	Assignee:	SyncEvolution Community <syncevolution-issues>
Status:	RESOLVED MOVED	QA Contact:
Severity:	enhancement
Priority:	medium	CC:	syncevolution-issues, tristan.van.berkom
Version:	1.3.99.3
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	56141

Description Patrick Ohly 2013-05-03 06:59:52 UTC

When sorting with Pinyin, names with Chinese characters need to be mixed with Western names using the Pinyin transliteration of the Chinese characters.

Supporting this might be as easy as selecting the Pinyin collation inside the ch_CH.UTF-* locale. This might be the default already. If not, we need either an explicit additional setting (env variable?!) to select the collation (may be useful anyway) and/or hard-code default collations for certain locales.

We also need tests cases.

Comment 1 Patrick Ohly 2013-05-06 10:47:58 UTC

Here are four names, one per line:
Adams
Jeffries
江
Meadows

江 has Jiang has Pinyin representation, so a collation based on Pinyin should sort as shown above (江 = Jiang after Jeffries and before Meadows). At least that's my understanding.

Unfortunately, I cannot reproduce this with the ICU web tool:
http://demo.icu-project.org/icu-bin/locexp?_=zh&d_=en&x=col&collation=pinyin

To reproduce, replace the "Source" text with the names above and hit "sort".
I get:
江
Adams
Jeffries
Meadows

Selecting and deselecting "Pinyin" as sort order has an effect. With the default sort order, 江 comes last.

Either the expected ordering above is wrong, ICU doesn't work as expected, or there is a bug in it (not likely?!).

Comment 2 Patrick Ohly 2013-05-06 15:50:05 UTC

Need help by a localization expert. I've contacted some colleagues in Intel working on that.

Comment 3 Patrick Ohly 2013-05-13 11:43:23 UTC

(In reply to comment #1)
> Here are four names, one per line:
> Adams
> Jeffries
> 江
> Meadows
> 
> 江 has Jiang has Pinyin representation, so a collation based on Pinyin should
> sort as shown above (江 = Jiang after Jeffries and before Meadows). At least
> that's my understanding.

A Chinese colleague confirmed that this is indeed what he expects.

From the icu-support mailing list:

-----------------------

From: 	Mark Davis ☕ <mark@macchiato.com>
Reply-to: 	ICU support mailing list <icu-support@lists.sourceforge.net>
To: 	ICU support mailing list <icu-support@lists.sourceforge.net>
Subject: 	Re: [icu-support] pinyin sorting in zh_CN.UTF-8
Date: 	Mon, 13 May 2013 13:02:11 +0200


People have different expectations for pinyin. Some possibilities are:
        Sort Chinese characters in pinyin order, but separate from Latin
        Sort them interleaved with Latin, by the first character.
        Sort them fully interleaved with Latin.
        For #2, the easiest way to do it is with the Alphabetic index. For #3, the best is to use a Han-Latin transliterator to get a key, then sort by that key.

------------------------

We now know that ICU implements option 1, so implementing the expected outcome will be more work. We also need to determine whether #2 or #3 are expected.

Comment 4 Murray Cumming 2013-05-14 08:48:31 UTC

> A Chinese colleague confirmed that this is indeed what he expects.
[snip]

It would be nice if we could base this on some standard that's written down somewhere, or more thoroughly documented as being de-facto common.

> We now know that ICU implements option 1, so implementing the expected
> outcome will be more work. We also need to determine whether #2 or #3 are
> expected.

It seems a little odd that ICU doesn't do something is apparently so common.

Comment 5 Patrick Ohly 2013-05-14 09:06:55 UTC

(In reply to comment #4)
> > A Chinese colleague confirmed that this is indeed what he expects.
> [snip]
> 
> It would be nice if we could base this on some standard that's written down
> somewhere, or more thoroughly documented as being de-facto common.

I suspect that there is no such document.

> > We now know that ICU implements option 1, so implementing the expected
> > outcome will be more work. We also need to determine whether #2 or #3 are
> > expected.
> 
> It seems a little odd that ICU doesn't do something is apparently so common.

My understanding is that all three options are valid, so ICU simply picked one. Perhaps they didn't pick the most popular one.

Comment 6 Patrick Ohly 2013-06-12 06:41:27 UTC

LocaleFactoryBoost::genLocale() implements a hard-coded list of languages where  "phonebook" collation is desirable. Currently this is "de" and "fi".
We could use it in all cases, except that ICU has a bug where it does not fall back properly to the base collation. See http://sourceforge.net/mailarchive/message.php?msg_id=30802924 and http://bugs.icu-project.org/trac/ticket/10149

In addition, fully interleaved Pinyin-based sorting is used for "zh". This requires an extra transliteration of Han->Latin, because ICU itself sorts Chinese characters after Latin ones when using the "Pinyin" collation.

EDS implements the same logic in the new ECollator utility class, scheduled for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's PIM Manager should use these classes.

Comment 7 Patrick Ohly 2013-11-19 16:13:41 UTC

(In reply to comment #6)
> EDS implements the same logic in the new ECollator utility class, scheduled
> for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's
> PIM Manager should use these classes.

The current EDS APIs lead to a slight performance degradation: ICU uses std::string, EDS copys into string, SyncEvolution recreates a std::string. A C++ API in EDS using std::string would be more useful.

For performance reasons I kept the code which uses ICU directly.

Comment 8 GitLab Migration User 2018-10-13 12:45:21 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/SyncEvolution/syncevolution/issues/148.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.