Bug 54843 - Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Summary: Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version: 4.0.0.0.alpha0+ Master
Hardware: Other All
: medium normal
Assignee: Not Assigned
QA Contact:
URL:
Whiteboard: target:3.7.0
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-13 07:47 UTC by László Németh
Modified: 2012-09-14 08:43 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Telugu test example (12.36 KB, application/vnd.oasis.opendocument.text)
2012-09-13 07:47 UTC, László Németh
Details

Description László Németh 2012-09-13 07:47:34 UTC
Created attachment 67077 [details]
Telugu test example

(From the bug report by Steven Dickson:)

There appears to be a logic error in the hnj_hyphen_rhmin function in the file hyphen.c.  The function is supposed to remove hyphens from the right hand side of a word based on the value of RIGHTHYPHENMIN defined in the hyphenation pattern file for the language.  It works properly for words containing only single-byte characters, but can fail if the word contains multi-byte characters.

 
The code erroneously assumes that the last character of the word is a single-byte character and starts scanning the word at the next to last byte of the word.  This can be corrected by initializing the character count variable, i, to 0 rather than 1 and starting the for loop with j = word_size – 1 rather than j = word_size -2.

 
The code also erroneously increments the character count variable, i, while still inside of a mult-byte character. This can be corrected by only incrementing i when at the first byte of a multi-byte character (word[j] & 0xc0 == 0xc0) or when at a single-byte character (word[j] & 0x80 != 0x80).

A diff of hyphen.c with the corrections follows.

737c737

<     int i = 1;

---

>     int i = 0;

743c743

<     for (j = word_size - 2; i < rhmin && j > 0; j--) {

---

>     for (j = word_size - 1; i < rhmin && j > 0; j--) {

756c756

<        if (!utf8 || (word[j] & 0xc0) != 0xc0) i++;

---

>        if (!utf8 || (word[j] & 0xc0) == 0xc0 || (word[j] & 0x80) != 0x80) i++;
Comment 1 László Németh 2012-09-13 08:00:10 UTC
Also fixed in the Hyphen CVS: http://hunspell.cvs.sourceforge.net/viewvc/hunspell/hyphen/
Comment 2 Not Assigned 2012-09-14 08:43:58 UTC
Laszlo Nemeth committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3d654071413bc107e0730dd31261c252f71572bf

fdo#54843 righthyphenmin fix (patch by Steven Dickson)



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.