Bug 9001

Summary:	Ligated characters are drawn multiple times when selected
Product:	poppler	Reporter:	Ed Catmur <ed>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium	CC:	ajohnson, freedesktop, jason, mpsuzuki, uws+freedesktop
Version:	unspecified
Hardware:	x86 (IA32)
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	ligature-selection-paint.patch ligature-selection-paint.patch ligature-selection-paint.patch ligature-selection-paint.patch ligature-selection-paint.patch ligature-overpaint.png ligature-selection-paint.patch ligature-selection-paint.patch

Description Ed Catmur 2006-11-13 01:55:50 UTC

When the selection covers a ligature presentation form where a single character
code corresponds to multiple Unicode codepoints, the glyph for the ligature is
drawn multiple times, once for each Unicode character.

This is particularly apparent when the patch to bug 8986 is applied.

Patch to follow.

Comment 1 Ed Catmur 2006-11-13 01:58:51 UTC

Created attachment 7769 [details] [review]
ligature-selection-paint.patch

Patch.

Comment 2 Ed Catmur 2006-11-13 01:59:43 UTC

Hm. Actually, that sucks; it draws ligature glyphs twice. I'll submit a better
patch when I have time.

Comment 3 Ed Catmur 2006-11-15 01:09:38 UTC

Created attachment 7795 [details] [review]
ligature-selection-paint.patch

Yes, much simpler.

Comment 4 Ed Catmur 2008-06-07 13:36:55 UTC

Created attachment 16984 [details] [review]
ligature-selection-paint.patch

Comment 5 Albert Astals Cid 2008-06-07 14:14:35 UTC

Doesn't changing "c" will result in pdftotext et al giving strange results?

Also i don't understand why the two loops, can you explain?

Comment 6 Ed Catmur 2013-07-16 22:54:53 UTC

Created attachment 82508 [details] [review]
ligature-selection-paint.patch

Updated patch.

Comment 7 Ed Catmur 2013-07-16 22:56:19 UTC

Actually, this isn't needed any more; the change to use the text matrix means the ligatures are painted in the same position each time, so it isn't worth the extra effort to skip them.  Closing.

Comment 8 Ed Catmur 2013-07-16 23:54:24 UTC

Scratch that, it is needed; the multiple painting is visible as darkening of the alpha-blended edges, especially when rendering the selection in a non-foreground window (so black on grey instead of white on blue).  Reopening.

Comment 9 Ed Catmur 2013-07-16 23:59:57 UTC

(In reply to comment #5)
> Doesn't changing "c" will result in pdftotext et al giving strange results?
charcode is only used by TextSelectionPainter::visitWord.  (If it was used by anything else we'd see ligatures being repeated.)

> Also i don't understand why the two loops, can you explain?
If the start of the selection is halfway through a ligature, we need to walk backwards to find the "real" character at the start of the ligature.  Cleaned up and simplified.

Comment 10 Carlos Garcia Campos 2013-07-18 08:43:09 UTC

How can I reproduce this? could you attach a screenshot showing the problem? Btw, the patch won't apply to current git master, please update your repo and rebase the patch. Thanks.

Comment 11 Ed Catmur 2013-07-18 23:05:19 UTC

Created attachment 82640 [details] [review]
ligature-selection-paint.patch

Comment 12 Ed Catmur 2013-07-18 23:06:20 UTC

Created attachment 82641 [details]
ligature-overpaint.png

Screenshot showing issue (using Evince).  pdf is from bug 7002.

Updated patch to current git.

Comment 13 Ed Catmur 2013-07-18 23:11:38 UTC

Created attachment 82642 [details] [review]
ligature-selection-paint.patch

Comment 14 Carlos Garcia Campos 2013-07-19 11:10:23 UTC

(In reply to comment #12)
> Created attachment 82641 [details]
> ligature-overpaint.png
> 
> Screenshot showing issue (using Evince).  pdf is from bug 7002.

Thanks! I see the problem now.

Comment 15 Carlos Garcia Campos 2013-07-19 11:13:36 UTC

Comment on attachment 82642 [details] [review]
ligature-selection-paint.patch

Review of attachment 82642 [details] [review]:
-----------------------------------------------------------------

Patch works. I wonder if we should return the ligatures as a single character instead so that we don't need a special case.

::: poppler/TextOutputDev.cc
@@ +2392,4 @@
>      w1 /= uLen;
>      h1 /= uLen;
>      for (i = 0; i < uLen; ++i) {
> +      if (i > 0) c = CHARCODE_LIGATED;

Could you explain why this means it's a ligature?

Comment 16 Ed Catmur 2013-07-19 15:04:03 UTC

(In reply to comment #15)
> Patch works. I wonder if we should return the ligatures as a single
> character instead so that we don't need a special case.

We currently support selecting individual characters within a ligature; it'd be a shame to lose that.

> ::: poppler/TextOutputDev.cc
> @@ +2392,4 @@
> >      w1 /= uLen;
> >      h1 /= uLen;
> >      for (i = 0; i < uLen; ++i) {
> > +      if (i > 0) c = CHARCODE_LIGATED;
> 
> Could you explain why this means it's a ligature?

uLen is greater than 1 when a single CharCode (i.e. a glyph) signifies multiple Unicode codepoints.  In English text, that typically occurs when the glyph is a ligature.  In other scripts that might not be the case (http://unicode.org/Public/UNIDATA/NamedSequences.txt) but if so we don't handle those correctly anyway; by chopping up the space occupied by the glyph we're assuming it's a ligature.

Maybe a better name like CHARCODE_GLYPH_CONTINUATION would be clearer?

Comment 17 Carlos Garcia Campos 2013-07-20 11:46:18 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > Patch works. I wonder if we should return the ligatures as a single
> > character instead so that we don't need a special case.
> 
> We currently support selecting individual characters within a ligature; it'd
> be a shame to lose that.

You are right.

> > ::: poppler/TextOutputDev.cc
> > @@ +2392,4 @@
> > >      w1 /= uLen;
> > >      h1 /= uLen;
> > >      for (i = 0; i < uLen; ++i) {
> > > +      if (i > 0) c = CHARCODE_LIGATED;
> > 
> > Could you explain why this means it's a ligature?
> 
> uLen is greater than 1 when a single CharCode (i.e. a glyph) signifies
> multiple Unicode codepoints.  In English text, that typically occurs when
> the glyph is a ligature.  In other scripts that might not be the case
> (http://unicode.org/Public/UNIDATA/NamedSequences.txt) but if so we don't
> handle those correctly anyway; by chopping up the space occupied by the
> glyph we're assuming it's a ligature.

Thanks for the explanation.

> Maybe a better name like CHARCODE_GLYPH_CONTINUATION would be clearer?

Ok, I think I understand better the problem now. The unicode of words is normalized, but charcode of caharcters corresponds to the ligature, so that we render ligatures, but extracted text and characters are split. This means that for split characters we have the same charcode, but different text and bounding box. This allows us to select individual characters of a ligature or search for individual characters as well. Is this right? Assuming it's right, what is common in all characters of a ligature while rendering is the charcode and the transformation matrix, so I wonder if we could generalize it and avoid rendering the same character twice always with something liked this:

if (i > begin &&
    sel->word->charcode[i - 1] == sel->word->charcode[i] &&
    sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] &&
    sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5])
      continue;

Your patch works fine, but changing the char looks a bit like a hack to me.

Comment 18 Ed Catmur 2013-07-21 00:23:03 UTC

Created attachment 82753 [details] [review]
ligature-selection-paint.patch

(In reply to comment #17)
> Ok, I think I understand better the problem now. The unicode of words is
> normalized, but charcode of caharcters corresponds to the ligature, so that
> we render ligatures, but extracted text and characters are split. This means
> that for split characters we have the same charcode, but different text and
> bounding box. This allows us to select individual characters of a ligature
> or search for individual characters as well. Is this right? Assuming it's
> right, what is common in all characters of a ligature while rendering is the
> charcode and the transformation matrix, so I wonder if we could generalize
> it and avoid rendering the same character twice always with something liked
> this:
> 
> if (i > begin &&
>     sel->word->charcode[i - 1] == sel->word->charcode[i] &&
>     sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] &&
>     sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5])
>       continue;
> 
> Your patch works fine, but changing the char looks a bit like a hack to me.

Yes, definitely. I'm a bit uncomfortable using the transformation matrix to identify characters as being from the same glyph; I'd prefer to use charPos as that references glyph identity directly:

    if (i != begin && sel->word->charPos[i] == sel->word->charPos[i - 1])
      continue;

Comment 19 Carlos Garcia Campos 2013-07-21 09:09:30 UTC

(In reply to comment #18)
> Created attachment 82753 [details] [review] [review]
> ligature-selection-paint.patch
> 
> (In reply to comment #17)
> > Ok, I think I understand better the problem now. The unicode of words is
> > normalized, but charcode of caharcters corresponds to the ligature, so that
> > we render ligatures, but extracted text and characters are split. This means
> > that for split characters we have the same charcode, but different text and
> > bounding box. This allows us to select individual characters of a ligature
> > or search for individual characters as well. Is this right? Assuming it's
> > right, what is common in all characters of a ligature while rendering is the
> > charcode and the transformation matrix, so I wonder if we could generalize
> > it and avoid rendering the same character twice always with something liked
> > this:
> > 
> > if (i > begin &&
> >     sel->word->charcode[i - 1] == sel->word->charcode[i] &&
> >     sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] &&
> >     sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5])
> >       continue;
> > 
> > Your patch works fine, but changing the char looks a bit like a hack to me.
> 
> Yes, definitely. I'm a bit uncomfortable using the transformation matrix to
> identify characters as being from the same glyph; I'd prefer to use charPos
> as that references glyph identity directly:
> 
>     if (i != begin && sel->word->charPos[i] == sel->word->charPos[i - 1])
>       continue;

Yes, this is a lot cleaner, thanks! I've just pushed it to git master.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.