Summary: | PIM: hard-code collation (Pinyin, phonebook) | ||
---|---|---|---|
Product: | SyncEvolution | Reporter: | Patrick Ohly <patrick.ohly> |
Component: | PIM Manager | Assignee: | SyncEvolution Community <syncevolution-issues> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | enhancement | ||
Priority: | medium | CC: | syncevolution-issues, tristan.van.berkom |
Version: | 1.3.99.3 | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 56141 |
Description
Patrick Ohly
2013-05-03 06:59:52 UTC
Here are four names, one per line: Adams Jeffries 江 Meadows 江 has Jiang has Pinyin representation, so a collation based on Pinyin should sort as shown above (江 = Jiang after Jeffries and before Meadows). At least that's my understanding. Unfortunately, I cannot reproduce this with the ICU web tool: http://demo.icu-project.org/icu-bin/locexp?_=zh&d_=en&x=col&collation=pinyin To reproduce, replace the "Source" text with the names above and hit "sort". I get: 江 Adams Jeffries Meadows Selecting and deselecting "Pinyin" as sort order has an effect. With the default sort order, 江 comes last. Either the expected ordering above is wrong, ICU doesn't work as expected, or there is a bug in it (not likely?!). Need help by a localization expert. I've contacted some colleagues in Intel working on that. (In reply to comment #1) > Here are four names, one per line: > Adams > Jeffries > 江 > Meadows > > 江 has Jiang has Pinyin representation, so a collation based on Pinyin should > sort as shown above (江 = Jiang after Jeffries and before Meadows). At least > that's my understanding. A Chinese colleague confirmed that this is indeed what he expects. From the icu-support mailing list: ----------------------- From: Mark Davis ☕ <mark@macchiato.com> Reply-to: ICU support mailing list <icu-support@lists.sourceforge.net> To: ICU support mailing list <icu-support@lists.sourceforge.net> Subject: Re: [icu-support] pinyin sorting in zh_CN.UTF-8 Date: Mon, 13 May 2013 13:02:11 +0200 People have different expectations for pinyin. Some possibilities are: Sort Chinese characters in pinyin order, but separate from Latin Sort them interleaved with Latin, by the first character. Sort them fully interleaved with Latin. For #2, the easiest way to do it is with the Alphabetic index. For #3, the best is to use a Han-Latin transliterator to get a key, then sort by that key. ------------------------ We now know that ICU implements option 1, so implementing the expected outcome will be more work. We also need to determine whether #2 or #3 are expected. > A Chinese colleague confirmed that this is indeed what he expects. [snip] It would be nice if we could base this on some standard that's written down somewhere, or more thoroughly documented as being de-facto common. > We now know that ICU implements option 1, so implementing the expected > outcome will be more work. We also need to determine whether #2 or #3 are > expected. It seems a little odd that ICU doesn't do something is apparently so common. (In reply to comment #4) > > A Chinese colleague confirmed that this is indeed what he expects. > [snip] > > It would be nice if we could base this on some standard that's written down > somewhere, or more thoroughly documented as being de-facto common. I suspect that there is no such document. > > We now know that ICU implements option 1, so implementing the expected > > outcome will be more work. We also need to determine whether #2 or #3 are > > expected. > > It seems a little odd that ICU doesn't do something is apparently so common. My understanding is that all three options are valid, so ICU simply picked one. Perhaps they didn't pick the most popular one. LocaleFactoryBoost::genLocale() implements a hard-coded list of languages where "phonebook" collation is desirable. Currently this is "de" and "fi". We could use it in all cases, except that ICU has a bug where it does not fall back properly to the base collation. See http://sourceforge.net/mailarchive/message.php?msg_id=30802924 and http://bugs.icu-project.org/trac/ticket/10149 In addition, fully interleaved Pinyin-based sorting is used for "zh". This requires an extra transliteration of Han->Latin, because ICU itself sorts Chinese characters after Latin ones when using the "Pinyin" collation. EDS implements the same logic in the new ECollator utility class, scheduled for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's PIM Manager should use these classes. (In reply to comment #6) > EDS implements the same logic in the new ECollator utility class, scheduled > for EDS 3.10 and included in the openismus-work-3-8 branch. SyncEvolution's > PIM Manager should use these classes. The current EDS APIs lead to a slight performance degradation: ICU uses std::string, EDS copys into string, SyncEvolution recreates a std::string. A C++ API in EDS using std::string would be more useful. For performance reasons I kept the code which uses ICU directly. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/SyncEvolution/syncevolution/issues/148. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.