Bug 46603

Summary:	some glyphs not displayed when text is selected
Product:	poppler	Reporter:	Jason Crain <jason>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED MOVED	QA Contact:
Severity:	normal
Priority:	medium	CC:	christopher.m.penalver
Version:	unspecified
Hardware:	x86 (IA32)
OS:	Linux (All)
URL:	https://bugzilla.gnome.org/show_bug.cgi?id=654473
Whiteboard:
i915 platform:		i915 features:
Attachments:	test pdf incorrect display in evince correct display in evince Fixed display for selected glyph in ActualText span Enable displayed chars to map to any number of text chars Fixes display for selected glyphs in ActualText span fix selection of glyphs in actualtext Enable displayed chars to map to any number of text chars The said pdf convert utf-16 to ucs-4 when reading ToUnicode move text string to unicode conversion into a separate function fix selection of glyphs in actualtext fix regressions Enable displayed chars to map to any number of text chars Don't reverse order of words with same xMin The pdf with the "white X in patch" problem don't start a new word if the previous char is a control char

Description Jason Crain 2012-02-25 01:13:44 UTC

Created attachment 57622 [details]
test pdf

Forwarding from gnome bugzilla: https://bugzilla.gnome.org/show_bug.cgi?id=654473

When selecting text in this pdf, some glyphs are not visible.  Text is displayed correctly when not selected.

Comment 1 Jason Crain 2012-02-25 01:14:58 UTC

Created attachment 57623 [details]
incorrect display in evince

Comment 2 Jason Crain 2012-02-25 01:15:35 UTC

Created attachment 57624 [details]
correct display in evince

Comment 3 Jason Crain 2012-02-25 01:25:40 UTC

Created attachment 57625 [details] [review]
Fixed display for selected glyph in ActualText span

This patch seems to correct the issue.  It sets the correct CharCode.

Comment 4 Albert Astals Cid 2012-02-25 07:45:38 UTC

I don't think this patch is correct, you are only setting the charcode once in ActualText::addChar so if multiple ActualText::addChar calls happen before ActualText::end the other charcodes are lost, no?

Comment 5 Adrian Johnson 2012-02-26 04:03:40 UTC

The patch doesn't work when the ActualText span contains more than one glyph. There is a test case in the test suite that demonstrates the problem:

http://cgit.freedesktop.org/poppler/test/tree/unittestcases/WithActualText.pdf

Comment 6 Jason Crain 2012-03-03 16:40:29 UTC

Created attachment 57989 [details] [review]
Enable displayed chars to map to any number of text chars

It's tricky when the length of the ActualText does not match the number of displayed glyphs.  This first patch modifies the TextWord, TextLine, TextLineFrag, TextBlock, and TextPage classes to suport displayed characters that can be mapped to any number of text characters.

Comment 7 Jason Crain 2012-03-03 16:46:38 UTC

Created attachment 57990 [details] [review]
Fixes display for selected glyphs in ActualText span

This sets the correct CharCode for each glyph in an ActualText span.  Attempts to match one text character to each glyph.  If there are more glyphs, they are added without matching text.  If there is more text, the remaining is added to the last glyph.

Comment 8 Adrian Johnson 2012-03-03 21:27:44 UTC

Created attachment 57993 [details] [review]
fix selection of glyphs in actualtext

Thanks for these patches. I have started looking at some of the text related bugs and the inability of TextOuputDev to understand the difference between glyphs and characters is the cause of some of these bugs. The first patch is very similar to the solution I had in mind. The first patch also fixes bug 9001.

Some comments on the first patch:

The following code in TextBlock::coalesce() needs fixing:
	  if (word2->len == word0->len &&
	      !memcmp(word2->text, word0->text,
		      word0->len * sizeof(Unicode))) {
len need to be replaced with textLen.

I don't think addChar should be renamed to addChars. My understanding of the code is that 'Char' is referring to the CharCodes and only one CharCode is added per call.

I would move the surrogate decoding to TextWord:addChar() and do the decoding as the unicode values are copied into the text array. This avoids having to make a copy of the unicode array in TextPage::addChar().

Some comments on the second patch:

I don't like the way the second patch is mapping the replacement text to the charcodes. I am attaching an alternative patch that distributes the replacement text evenly across the charcodes.

Comment 9 Jason Crain 2012-03-04 20:38:26 UTC

Comment on attachment 57993 [details] [review]
fix selection of glyphs in actualtext

Review of attachment 57993 [details] [review]:
-----------------------------------------------------------------

::: poppler/TextOutputDev.cc
@@ +5331,5 @@
> +    // If this is the last glyph ensure all remaining text is included
> +    // as pos may be < length due to rounding errors.
> +    if (i == lenGlyphs - 1)
> +      count = length - first;
> +    text->addChar(state, glyphs[i].x, glyphs[i].y, glyphs[i].dx, glyphs[i].dy,

This needs to make sure that a surrogate pair is not split

Comment 10 Jason Crain 2012-03-04 21:09:24 UTC

Created attachment 58013 [details] [review]
Enable displayed chars to map to any number of text chars

updated patch

Comment 11 Albert Astals Cid 2012-03-05 15:04:19 UTC

Guys, i'm a bit lost, sorry, are both patches supposed to fix the same issue in a different way?

Comment 12 Adrian Johnson 2012-03-05 15:16:54 UTC

Both patches are required.

The "fix selection of glyphs in actualtext" ensures all charcodes are passed through to TextPage. I need to update this to correctly handle surrogates.

The "enable displayed chars to map to any number of text chars" is a prerequisite for the "fix selection of glyphs in actualtext" patch. It also fixes text selection of ligatures (bug 9001).

Comment 13 Albert Astals Cid 2012-03-05 15:41:38 UTC

When using Jason's patch on a pdf that i will be attaching in a moment, pdftotext changes from extracting

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ 𝑥 ∖ {𝐿} in the same way as we
did in the proof of Proposition 5. In the disk model ̃
𝑃 with boundary circle 𝐿, every
line 𝐻 ∈ ℒ 𝑥 ∖{𝐿} separates ̃
𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely

to

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ𝑥 ∖ {𝐿} in the same way as we
 with boundary circle 𝐿, every
did in the proof of Proposition 5. In the disk model 𝑃
 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
line 𝐻 ∈ ℒ𝑥 ∖{𝐿} separates 𝑃
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely


You can see that in the second case some weird/unwanted reordering happened

Comment 14 Albert Astals Cid 2012-03-05 15:42:20 UTC

Created attachment 58045 [details]
The said pdf

Comment 15 Adrian Johnson 2012-03-08 04:05:41 UTC

Created attachment 58178 [details] [review]
convert utf-16 to ucs-4 when reading ToUnicode

The next two patches fix the problem of "fix selection of glyphs in actualtext" not handling surrogates. The "Unicode" type is meant to be UCS-4 so the solution is to convert UTF-16 to UCS-4 when it the ToUnicode cmap is parsed.

This patch does the UTF-16 conversion in CharCodeToUnicode.cc. As a result the special surrogate handling in TextOutputDev and HtmlOutputDev can be removed.

Comment 16 Adrian Johnson 2012-03-08 04:09:18 UTC

Created attachment 58179 [details] [review]
move text string to unicode conversion into a separate function

This patch adds a new function for converting PDF text strings to UCS-4. As a result the duplicated code in TextOutputDev and pdfinfo can be replaced by a call to this function. 

This patch is to ensure that my updated "fix selection of glyphs in actualtext" does not have to care about surrogates.

Comment 17 Adrian Johnson 2012-03-08 04:12:30 UTC

Created attachment 58180 [details] [review]
fix selection of glyphs in actualtext

The updated version of "fix selection of glyphs in actualtext".

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext

Patch 3 needs to be updated to remove the surrogate handling and fix the regression.

Comment 18 Adrian Johnson 2012-03-18 15:18:23 UTC

Created attachment 58643 [details] [review]
fix regressions

This patch fixes the regressions in "Enable displayed chars to map to any number of text chars".

The problem is the changes now allow glyphs that map to zero length unicode strings to be added to TextWords. Often these glyphs have overlapping bounding boxes or are not on the same baseline. This confuses TextOutputDev when trying to determine the layout of the text.

This patch does two things:
- it avoids breaking words when one of these glyphs with an empty mapping is encountered
- it increases the tolerance for overlapping bounding boxes.

With the attached PDF the result the text output is still different but checking the differences it is actually an improvement.

However I suspect the changes could potentially break other PDFs. If this patch causes problems, plan B is to change TextOutputDev to ignore the glyphs with zero mapping when determining the text layout (but still add these glyphs to the words to make text selection work correctly). This should emulate the old behavior as closely as possible.

Comment 19 Adrian Johnson 2012-03-18 15:21:09 UTC

Created attachment 58644 [details] [review]
Enable displayed chars to map to any number of text chars

This is Jason's patch rebased so it applies on top of my first two patches.

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext
5 - fix regressions

Comment 20 Albert Astals Cid 2012-03-19 15:30:08 UTC

With all the patches applied the pdftotext extraction of https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459
changes from

  • 
Patches may be grouped with other patches to test the whole of a

to

  •  P
 atches may be grouped with other patches to test the whole of a

The correct extraction would be having no newline, but if we are going to have a newline i very much prefer the old way than the new one that breaks the word after the P

Comment 21 Adrian Johnson 2012-03-22 05:07:01 UTC

Created attachment 58860 [details] [review]
Don't reverse order of words with same xMin

This fixes the regression. Output is now:

•  Patches may be grouped with other patches to test the whole of a

Comment 22 Albert Astals Cid 2012-03-22 15:29:49 UTC

Tehre's still some problems, in the file that i will be attaching

• 
white X in patch b:

gets changed to

•w
 hite X in patch b:

Comment 23 Albert Astals Cid 2012-03-22 15:32:29 UTC

Created attachment 58892 [details]
The pdf with the "white X in patch" problem

Comment 24 Adrian Johnson 2012-03-25 04:48:16 UTC

Created attachment 59003 [details] [review]
don't start a new word if the previous char is a control char

Problem is caused by a ^G that overlaps the first letter of the word. This patch avoids started a new word if a control characters overlaps other character. Although it would probably be better to strip out control characters from the extracted text. Acroread does not include the ^G in extracted text.

Comment 25 Albert Astals Cid 2012-03-25 08:17:29 UTC

With this last patch now we get
       P
atch creation date
  instead of        
Patch creation date
again in https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459

Comment 26 Jason Crain 2014-12-17 13:33:13 UTC

*** Bug 87401 has been marked as a duplicate of this bug. ***

Comment 27 GitLab Migration User 2018-08-21 10:44:56 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/350.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.