When the selection covers a ligature presentation form where a single character code corresponds to multiple Unicode codepoints, the glyph for the ligature is drawn multiple times, once for each Unicode character. This is particularly apparent when the patch to bug 8986 is applied. Patch to follow.
Created attachment 7769 [details] [review] ligature-selection-paint.patch Patch.
Hm. Actually, that sucks; it draws ligature glyphs twice. I'll submit a better patch when I have time.
Created attachment 7795 [details] [review] ligature-selection-paint.patch Yes, much simpler.
Created attachment 16984 [details] [review] ligature-selection-paint.patch
Doesn't changing "c" will result in pdftotext et al giving strange results? Also i don't understand why the two loops, can you explain?
Created attachment 82508 [details] [review] ligature-selection-paint.patch Updated patch.
Actually, this isn't needed any more; the change to use the text matrix means the ligatures are painted in the same position each time, so it isn't worth the extra effort to skip them. Closing.
Scratch that, it is needed; the multiple painting is visible as darkening of the alpha-blended edges, especially when rendering the selection in a non-foreground window (so black on grey instead of white on blue). Reopening.
(In reply to comment #5) > Doesn't changing "c" will result in pdftotext et al giving strange results? charcode is only used by TextSelectionPainter::visitWord. (If it was used by anything else we'd see ligatures being repeated.) > Also i don't understand why the two loops, can you explain? If the start of the selection is halfway through a ligature, we need to walk backwards to find the "real" character at the start of the ligature. Cleaned up and simplified.
How can I reproduce this? could you attach a screenshot showing the problem? Btw, the patch won't apply to current git master, please update your repo and rebase the patch. Thanks.
Created attachment 82640 [details] [review] ligature-selection-paint.patch
Created attachment 82641 [details] ligature-overpaint.png Screenshot showing issue (using Evince). pdf is from bug 7002. Updated patch to current git.
Created attachment 82642 [details] [review] ligature-selection-paint.patch
(In reply to comment #12) > Created attachment 82641 [details] > ligature-overpaint.png > > Screenshot showing issue (using Evince). pdf is from bug 7002. Thanks! I see the problem now.
Comment on attachment 82642 [details] [review] ligature-selection-paint.patch Review of attachment 82642 [details] [review]: ----------------------------------------------------------------- Patch works. I wonder if we should return the ligatures as a single character instead so that we don't need a special case. ::: poppler/TextOutputDev.cc @@ +2392,4 @@ > w1 /= uLen; > h1 /= uLen; > for (i = 0; i < uLen; ++i) { > + if (i > 0) c = CHARCODE_LIGATED; Could you explain why this means it's a ligature?
(In reply to comment #15) > Patch works. I wonder if we should return the ligatures as a single > character instead so that we don't need a special case. We currently support selecting individual characters within a ligature; it'd be a shame to lose that. > ::: poppler/TextOutputDev.cc > @@ +2392,4 @@ > > w1 /= uLen; > > h1 /= uLen; > > for (i = 0; i < uLen; ++i) { > > + if (i > 0) c = CHARCODE_LIGATED; > > Could you explain why this means it's a ligature? uLen is greater than 1 when a single CharCode (i.e. a glyph) signifies multiple Unicode codepoints. In English text, that typically occurs when the glyph is a ligature. In other scripts that might not be the case (http://unicode.org/Public/UNIDATA/NamedSequences.txt) but if so we don't handle those correctly anyway; by chopping up the space occupied by the glyph we're assuming it's a ligature. Maybe a better name like CHARCODE_GLYPH_CONTINUATION would be clearer?
(In reply to comment #16) > (In reply to comment #15) > > Patch works. I wonder if we should return the ligatures as a single > > character instead so that we don't need a special case. > > We currently support selecting individual characters within a ligature; it'd > be a shame to lose that. You are right. > > ::: poppler/TextOutputDev.cc > > @@ +2392,4 @@ > > > w1 /= uLen; > > > h1 /= uLen; > > > for (i = 0; i < uLen; ++i) { > > > + if (i > 0) c = CHARCODE_LIGATED; > > > > Could you explain why this means it's a ligature? > > uLen is greater than 1 when a single CharCode (i.e. a glyph) signifies > multiple Unicode codepoints. In English text, that typically occurs when > the glyph is a ligature. In other scripts that might not be the case > (http://unicode.org/Public/UNIDATA/NamedSequences.txt) but if so we don't > handle those correctly anyway; by chopping up the space occupied by the > glyph we're assuming it's a ligature. Thanks for the explanation. > Maybe a better name like CHARCODE_GLYPH_CONTINUATION would be clearer? Ok, I think I understand better the problem now. The unicode of words is normalized, but charcode of caharcters corresponds to the ligature, so that we render ligatures, but extracted text and characters are split. This means that for split characters we have the same charcode, but different text and bounding box. This allows us to select individual characters of a ligature or search for individual characters as well. Is this right? Assuming it's right, what is common in all characters of a ligature while rendering is the charcode and the transformation matrix, so I wonder if we could generalize it and avoid rendering the same character twice always with something liked this: if (i > begin && sel->word->charcode[i - 1] == sel->word->charcode[i] && sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] && sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5]) continue; Your patch works fine, but changing the char looks a bit like a hack to me.
Created attachment 82753 [details] [review] ligature-selection-paint.patch (In reply to comment #17) > Ok, I think I understand better the problem now. The unicode of words is > normalized, but charcode of caharcters corresponds to the ligature, so that > we render ligatures, but extracted text and characters are split. This means > that for split characters we have the same charcode, but different text and > bounding box. This allows us to select individual characters of a ligature > or search for individual characters as well. Is this right? Assuming it's > right, what is common in all characters of a ligature while rendering is the > charcode and the transformation matrix, so I wonder if we could generalize > it and avoid rendering the same character twice always with something liked > this: > > if (i > begin && > sel->word->charcode[i - 1] == sel->word->charcode[i] && > sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] && > sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5]) > continue; > > Your patch works fine, but changing the char looks a bit like a hack to me. Yes, definitely. I'm a bit uncomfortable using the transformation matrix to identify characters as being from the same glyph; I'd prefer to use charPos as that references glyph identity directly: if (i != begin && sel->word->charPos[i] == sel->word->charPos[i - 1]) continue;
(In reply to comment #18) > Created attachment 82753 [details] [review] [review] > ligature-selection-paint.patch > > (In reply to comment #17) > > Ok, I think I understand better the problem now. The unicode of words is > > normalized, but charcode of caharcters corresponds to the ligature, so that > > we render ligatures, but extracted text and characters are split. This means > > that for split characters we have the same charcode, but different text and > > bounding box. This allows us to select individual characters of a ligature > > or search for individual characters as well. Is this right? Assuming it's > > right, what is common in all characters of a ligature while rendering is the > > charcode and the transformation matrix, so I wonder if we could generalize > > it and avoid rendering the same character twice always with something liked > > this: > > > > if (i > begin && > > sel->word->charcode[i - 1] == sel->word->charcode[i] && > > sel->word->textMat[i - 1].m[4] == sel->word->textMat[i].m[4] && > > sel->word->textMat[i - 1].m[5] == sel->word->textMat[i].m[5]) > > continue; > > > > Your patch works fine, but changing the char looks a bit like a hack to me. > > Yes, definitely. I'm a bit uncomfortable using the transformation matrix to > identify characters as being from the same glyph; I'd prefer to use charPos > as that references glyph identity directly: > > if (i != begin && sel->word->charPos[i] == sel->word->charPos[i - 1]) > continue; Yes, this is a lot cleaner, thanks! I've just pushed it to git master.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.