Bug 6923

Summary: characters change when selecting text
Product: poppler Reporter: Sebastien Bacher <seb128>
Component: cairo backendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: a9016009, carlosgc, damian01w, freedesktop, gasche.dylc, jason, jose.aliste, juanj.marin, mail, martin.pitt, mitya57, mpsuzuki, phuang, tinka
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Minimal pdf that reproduces the problem to help debugging
Patch fixing the issue
alternate patch comparing gfxFont instead of pointers to FontInfos
updated alternate patch
The pdf with the pdftotext regression
Allow multiple fonts in a TextWord
Allow multiple fonts in a TextWord
Another small test case
Change default fallback character width
The pdf where the patch crashes
Check for NaN in TextPage::addChar
Allow multiple fonts in a TextWord
Unselected text with evince 3.10.3 (Ubuntu 14.04)
Selected/Corrupted text with evince 3.10.3 (Ubuntu 14.04)

Description Sebastien Bacher 2006-05-15 05:53:22 UTC
That bug has been described on
https://launchpad.net/distros/ubuntu/+source/evince/+bug/39890

"I have problems with the following pdf:
www.lacim.uqam.ca/~plouffe/OEIS/citations/MAS-R9821.pdf

When marking text some characters change. This happens on many places in this
pdf-file. To reproduce the bug just mark some random places in the pdf. In some
places the text disappears. In some the characters change totally. In others the
italics is removed.

I use the latest update of dapper. Evince 0.5.2
...
> Thank you for reporting this issue.

I can also reproduce it. I don't think that the different characters are really
different characters, just a different font. For example, the character ']'
(right square bracket) looks like the character '#' in the font cmmi10, and that
is the character that appears when it is highlighted."
Comment 1 Sebastien Bacher 2007-10-18 01:26:40 UTC
That's still happening using the cairo backend and poppler 0.6
Comment 2 Alkis Georgopoulos 2008-08-05 23:31:11 UTC
Created attachment 18147 [details]
Minimal pdf that reproduces the problem to help debugging

I've located exactly when the bug appears, and I've hand-coded a ***minimal*** PDF as a testcase. I'm attaching it, when selecting the text you should see
BCDΓ
transforming to
ABC<display artifact>

(notice the change from BCD to ABC).

Kind regards,
Alkis Georgopoulos
Comment 3 Leonardo Damián Barberón 2009-02-16 15:20:15 UTC
Anyone?
Comment 4 Albert Astals Cid 2009-06-17 14:08:44 UTC
Can you try with a newer poppler?
Comment 5 Sebastien Bacher 2009-09-03 03:07:29 UTC
The example on this bug is still buggy on 0.11.3
Comment 6 Juanjo Marin 2009-12-07 03:00:52 UTC
The example on this bug is still buggy on Evince 2.29.3 with poppler 0.12.0

Comment 7 Andrew Tinka 2010-07-19 18:16:07 UTC
evincebug.pdf still displays the buggy behavior in evince 2.30.3, poppler 0.12.4
Comment 8 Jose Aliste 2011-02-14 16:26:15 UTC
For the problem in the minimal example, the culprit seems to be in TextSelectionPainter::visitWord, which assumes that all the chars in a word use the same font. This is demonstrated by adding a space After "Word:" in the bad line. This also explain why this bug is very common with LaTeX files.
Comment 9 Jose Aliste 2011-02-14 18:24:06 UTC
Created attachment 43360 [details] [review]
Patch fixing the issue

Actually, All the chars in a TextWord MUST share the same font, so I added a check so we also create a new word when the font changes. The attached patch fixes the bug in the minimal example and many other problems when selecting text in PdfLatex-generated files, but there are still issues in rendering selections with Bitmapped fonts. In this case, the glyph seems right but the size is wrong.
Comment 10 Jose Aliste 2011-02-15 00:16:56 UTC
The last patch does not fix either some problems when selecting ligatures in some fonts, see http://java.sun.com/docs/books/jvms/second_edition/ClassFileFormat-final-draft.pdf, page 54 and look for the word "reflective" which is rendered as "reßective" (althought the text select is indeed "reflective". This does not happen with all fonts, for instance, in a pdflatex generated file, I could not reproduce the problem.
Comment 11 Albert Astals Cid 2011-02-15 15:28:23 UTC
You are comparing the pointers of the font infos, shouldn't you be comparing the contents of the pointers?
Comment 12 Jose Aliste 2011-02-16 00:34:26 UTC
I assumed that the only code that is updating the curFont is TextPage::updateFont. If this is true, from what I understood from the code, the fonts array holds unique FontInfos, so it would be safe to compare pointers instead of FontInfos. So, is this right? If so, and you still want me to rework the patch so it compares the fonts infos instead of the pointers, I won't argue and will update the patch, but I'd appreciate an answer so I can have a better understanding of the code, Thanks
Comment 13 Jose Aliste 2011-02-16 02:06:35 UTC
Created attachment 43420 [details] [review]
alternate patch comparing gfxFont instead of pointers to FontInfos

Even if what I say in last comment may be true, I followed your advice and 
updated the patch to compare the gfxFonts instead of the pointers to the TextInfos.
Comment 14 Jose Aliste 2011-02-16 02:14:11 UTC
Created attachment 43421 [details] [review]
updated alternate patch

I apologize for the spam.  Previous patch was clearly wrong. Updated patch.
Comment 15 Albert Astals Cid 2011-02-16 13:24:58 UTC
The patch changes (breaks) the behaviour of pdftotext so that is not acceptable.  You can see that in the attached pdf
"bold strike:"
changes to be extracted to
"bold strike :"
Comment 16 Albert Astals Cid 2011-02-16 13:25:48 UTC
Created attachment 43455 [details]
The pdf with the pdftotext regression
Comment 17 Jose Aliste 2011-02-16 16:22:44 UTC
Thanks for the regression. The problem here is that "bold:" has two fonts since "bold" is italicised and  ":" not, so before the patch, "bold:" is a TextWord and pdftotext get the text right, but drawing the selection is bad since the selected ":"  would be drawn italicised. After the patch, "bold:" gets splitted up into "bold" and ":" so it gets drawn correctly when selected, but you have the regression you pointed out. So, I believe I am stuck with this choice: 

a). I could either allow more than one font on a TextWord, and adapt the code that draws the TextWord to use that fact, so the  selected ":" does not get transform to a italicised ":" when drawing it. 

b) or I could fix the TextDumper to be aware of the fact in some cases there is no space between two TextWord. 

c) Do you have another way? 

I think I want to take approach a) even if it could more complicated, but approach b) seems that it could break more things that it would fix... thinking that when doing selection by words would not work at all. What do you think?
Comment 18 Paul Sladen 2011-02-16 22:47:55 UTC
It's perfectly natural that every letter in a word be in a different font, colour, or style so that needs to be allowed for (the text could even a scan of some handwriting, or several scans), .  ...what pdftotext is trying to do is to replicate what would be found in a Tagged PDF (Section 9.7 of the PDF Reference), effectively synthesising what would ideally be found in '/BMC.../EMC' '/ActualText' contents.

To that end I'm inclined to agree that (b) sounds like non-ideal kludge on top of non-ideal kludge.
Comment 19 Albert Astals Cid 2011-02-17 12:45:47 UTC
I tend to agree with Paul that is fine for a word to have different fonts, but on the other hand one can argue if the ":" in that situation is part of the word or not.

I'd say 1 is a better solution but if you can get 2 with a "clean enough" code i won't be oposed to it either.

Personally i think text selection should have been done in client side (like we do in Okular) and not "modyfing" poppler internal structures but that's a bit too late i guess :D
Comment 20 Paul Sladen 2011-02-17 21:01:53 UTC
Albert: given the lack of a space ( ) in the input, there's shouldn't be one in the output! :)
Comment 21 Albert Astals Cid 2011-02-18 00:18:00 UTC
Paul: Of course, i never said this is an acceptable behaviour. (What makes you think i did?)

I only said that from the technical point of view you can argue that ":" is of course a different word since you won't find the word "strike:" in a dictionary.
Comment 22 gasche.dylc 2011-02-24 03:51:24 UTC
As there been any additional progress on this issue? I must say I have been encountering it quite frequently for a long time, and for one would be very happy to see a fix applied.

Anyway, thanks to anyone working on this.
Comment 23 Dmitry Shachnev 2011-06-27 04:13:53 UTC
An interesting thing:
When I select the "Wrong" line in PDF from comment 2, BCDГ is changing not to ABC<display artifact>, but to ABC<non-selected letter Г>, and when I then copy-paste that line, I get "Wrong:BCD", without the last letter.

@José: Please, do something with this! Sometimes selecting text does really terrible things (take a look at 3 screenshots in this archive: http://ubuntuone.com/p/11Pa/, for example).
Comment 24 Albert Astals Cid 2011-06-27 11:16:31 UTC
Dmitry please to not modify those fields, that's not for users to choose.
Comment 25 Jason Crain 2012-08-12 03:56:52 UTC
Created attachment 65449 [details] [review]
Allow multiple fonts in a TextWord

This patch modifies TextWord so that each character can be a different font.  Function poppler_page_get_text_attributes is updated to allow for attributes to change mid-word.
Comment 26 Carlos Garcia Campos 2012-08-14 14:56:28 UTC
(In reply to comment #25)
> Created attachment 65449 [details] [review] [review]
> Allow multiple fonts in a TextWord
> 
> This patch modifies TextWord so that each character can be a different font. 
> Function poppler_page_get_text_attributes is updated to allow for attributes to
> change mid-word.

Thanks for the patch!, Albert could you run the regtests for text backend with this patch, please?
Comment 27 Albert Astals Cid 2012-08-14 21:10:57 UTC
Sure, on my todo (man i'm behind again in poppler :-()
Comment 28 Carlos Garcia Campos 2012-08-15 10:20:34 UTC
Sure, no hurry, thanks!
Comment 29 Albert Astals Cid 2012-08-15 22:52:16 UTC
Good news first, it does not change the output of pdftotext (well it does it in a file, but i don't consider it a regression)

Yay :-)

Now onto the review of the code itself.

From what i can read it seems it is possible to have in the same word fonts with different WMode (i.e. fonts written vertically and horizontally) i don't think this makes any sense.

What do you think?

Note i have *not* verified it fixes the problem mentioned in this bug. Carlos can you do that part?
Comment 30 Albert Astals Cid 2012-08-16 23:25:46 UTC
There's also a weird thing.

If i get all the character positions and dump them to file, sometimes there's minor changes, e.g. in https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/881407/+attachment/2571595/+files/Manning-ExtJS_in_Action.pdf

the diff says

--- old/foo     2012-08-17 01:16:21.566960179 +0200
+++ new/foo     2012-08-17 01:16:21.566960179 +0200
@@ -51275,13 +51275,16 @@
 QRectF(289.28,579.535 2.4948x4.725) 
 true 
 QRectF(289.28,579.535 2.4948x4.725) 
-"FumiK" 
-QRectF(294.666,579.535 10.0606x4.725) 
-true 
+"Fumi" 
+QRectF(294.666,579.535 7.64003x4.725) 
+false 
 QRectF(294.666,579.535 2.08791x4.725) 
 QRectF(296.754,579.535 1.73864x4.725) 
 QRectF(298.492,579.535 2.8904x4.725) 
-QRectF(301.383,579.535 0.349866x4.725) 
+QRectF(301.383,579.535 0.923076x4.725) 
+"K" 
+QRectF(301.733,579.535 2.99376x4.725) 
+true 
 QRectF(301.733,579.535 2.99376x4.725) 
 "®" 
 QRectF(308.586,580.274 3.06445x3.885) 

the more interesting part is 

-QRectF(301.383,579.535 0.349866x4.725) 
+QRectF(301.383,579.535 0.923076x4.725) 

it seems the "i" moved from being 0.349866 wide to being 0.923076. Any idea why those changes happen? Looking at the code i don't see any obvious reason for them.
Comment 31 Carlos Garcia Campos 2012-08-18 17:20:40 UTC
(In reply to comment #29)

> Note i have *not* verified it fixes the problem mentioned in this bug. Carlos
> can you do that part?

it's weird, I don't see any difference with and without the patch with poppler-glib-demo and the minimal example attached to this bug.
Comment 32 Jason Crain 2012-08-20 11:27:11 UTC
The difference in character width is because of this change:

@@ -863,7 +839,7 @@ void TextLine::coalesce(UnicodeMap *uMap) {
 	word0->spaceAfter = gTrue;
 	word0 = word1;
 	word1 = word1->next;
-      } else if (word0->font == word1->font &&
+      } else if (word0->font[word0->len - 1] == word1->font[0] &&
 		 word0->underlined == word1->underlined &&
 		 fabs(word0->fontSize - word1->fontSize) <
 		   maxWordFontSizeDelta * words->fontSize &&


Before, the words "Fumi" and "K" were being merged.  Letters "F" and "K" are using the font BookmanOldStyle.  Letters "umi" are using ArialMT.  Because of this font check, the words are no longer being merged, making the width of the i different.

Yes, it is possible for one word to have fonts with different WModes.  Still looking into it, but I suppose it will need to start a new word on WMode changes.

Carlos: I don't know why you do not see a difference.  Without this patch, selecting text in the minimal example changes the text to "ABC".  With this patch, it stays "BCD".
Comment 33 Albert Astals Cid 2012-08-20 18:07:25 UTC
(In reply to comment #32)
> The difference in character width is because of this change:
> 
> @@ -863,7 +839,7 @@ void TextLine::coalesce(UnicodeMap *uMap) {
>      word0->spaceAfter = gTrue;
>      word0 = word1;
>      word1 = word1->next;
> -      } else if (word0->font == word1->font &&
> +      } else if (word0->font[word0->len - 1] == word1->font[0] &&
>           word0->underlined == word1->underlined &&
>           fabs(word0->fontSize - word1->fontSize) <
>             maxWordFontSizeDelta * words->fontSize &&
> 
> 
> Before, the words "Fumi" and "K" were being merged.  Letters "F" and "K" are
> using the font BookmanOldStyle.  Letters "umi" are using ArialMT.  Because of
> this font check, the words are no longer being merged, making the width of the
> i different.

I understand why they are not merged (by code not, why the code is put there), but why the "i" has a different width that makes it overlap with the "K" when previously didn't?
Comment 34 Jason Crain 2012-08-20 22:46:19 UTC
(In reply to comment #33)
> I understand why they are not merged (by code not, why the code is put there),
> but why the "i" has a different width that makes it overlap with the "K" when
> previously didn't?

Before being merged, the width of the 'i' is 0.923076.  In TextWord::merge, the right edge of 'i' is set equal to the left edge of 'K', reducing its width to 0.349866.
Comment 35 Albert Astals Cid 2012-08-20 22:56:06 UTC
Ok, let's not worry about that for the moment. I'm more concerned about Carlos not being able to verify this indeed fixes the problem it's supposed to be fixing.

Carlos can you give it a second try?
Comment 36 Jose Aliste 2012-08-21 04:04:09 UTC
I tried patched poppler with the minimal example and I don't get the change of the glyphs anymore under the selection, which is VERY GOOD!! But I can't select the Gamma glyph...
Comment 37 Jason Crain 2012-08-23 04:07:27 UTC
Created attachment 65992 [details] [review]
Allow multiple fonts in a TextWord

It was possible for a TextWord to have fonts for both horizontal and vertical writing modes.  This version of the patch starts a new word when the WMode changes.
Comment 38 Jason Crain 2012-08-23 05:05:12 UTC
I am not sure how to fix the problem with the Gamma glyph not being selectable.  It is not selectable because it has a zero width, because there is no Widths array in the file.  This is a violation of the spec.  Poppler has a mechanism to provide default widths, but this does not work for symbols.
Comment 39 Dmitry Shachnev 2012-08-25 11:53:23 UTC
Created attachment 66101 [details]
Another small test case

The patch doesn't fully solve the problem for me. I have attached another PDF file (which only contains one math formula) with which I still can reproduce the issue.

When I try to select the Sigma ("∑") in Evince, limits ("m=0" and "∞") disappear; when I try to select "x/2" fraction, I get an orange or black rectangle (I use Ubuntu's Ambiance theme, so normally selection color should be always orange).

I can confirm that the "Minimal pdf" attached here is fixed by this patch, though.
Comment 40 Albert Astals Cid 2012-08-26 22:02:56 UTC
Jason want me to have another look at your patch or you want to try to fix the issue by Dmitry first?
Comment 41 Jason Crain 2012-08-27 06:25:16 UTC
Yes, please review the patch as it is.

I have looked at the document from Dmitry.  It exposes two issues, but these are separate from minimal pdf.  1) TextLines may be drawn over each other, obscuring other text.  2) In CairoOutputDev::updateFillColor, the color may not be set correctly.
Comment 42 Jason Crain 2012-08-27 06:39:17 UTC
Created attachment 66163 [details] [review]
Change default fallback character width

Also, I don't know how you feel about fixing something caused by violating the PDF specification, but this patch changes the default width for a missing character, making the gamma symbol selectable in the minimal pdf.
Comment 43 Albert Astals Cid 2012-08-30 21:14:17 UTC
(In reply to comment #37)
> Created attachment 65992 [details] [review] [review]
> Allow multiple fonts in a TextWord
> 
> It was possible for a TextWord to have fonts for both horizontal and vertical
> writing modes.  This version of the patch starts a new word when the WMode
> changes.

When running pdftotext with this patch on a pdf i'm going to attach it crashes (without the patch it works), can you please fix it?
Comment 44 Albert Astals Cid 2012-08-30 21:14:56 UTC
Created attachment 66375 [details]
The pdf where the patch crashes
Comment 45 Jason Crain 2012-09-01 15:10:29 UTC
Created attachment 66450 [details] [review]
Check for NaN in TextPage::addChar

I don't think this is related to my earlier patch. For me, this pdf crashes both with and without it.  This document is doing very strange things with the current transformation matrix (CTM) and inline images.  Pages 6 and 15 are filled with lines like this:

    q 18 0 0 -1 2782 6350 cm
    q BI
    <IMAGE DICT>
    ID <IMAGE DATA>
    EI Q
    q 19 0 0 -1 2782 6350 cm
    q BI
    <IMAGE DICT>
    ID <IMAGE DATA>
    EI Q

Note the unbalanced q/Q for saving/restoring the graphics state.  This means that the graphics state is not ever being properly restored and the `cm' operator is scaling the CTM until its components become NaN.  This leads to TextWord::base being NaN.  This breaks calculations in TextPool::addWord, causing wordBaseIdx to be INT_MIN, causing the text pool to not be initialized to NULLs, which causes a crash when an invalid pointer is read and dereferenced from the pool.

As a test, adding a call to restoreState() in Gfx::opBeginImage allows the page to render properly and without crashing.  Otherwise, poppler either crashes or places text in an invalid location.

The attached patch adds a check for NaN to TextPage::addChar and throws away chars with invalid positions.
Comment 46 Jason Crain 2012-09-01 17:37:04 UTC
Created attachment 66454 [details] [review]
Allow multiple fonts in a TextWord

Updated patch to apply to current git master.
Comment 47 Albert Astals Cid 2012-09-11 17:32:29 UTC
I've commited this patch to master (what will be 0.22) and not to 0.20 since it breaks the public glib api

I'm closing this bug since it fixes the original reported issue. If there are issues not fixed by this bug please open a new one, big bugs get difficult to keep track of otherwise.
Comment 48 Carlos Garcia Campos 2012-09-11 17:58:09 UTC
(In reply to comment #47)
> I've commited this patch to master (what will be 0.22) and not to 0.20 since it
> breaks the public glib api

Really? why? I agree with not committing it to the stable branch, but I don't think the patch breaks the API, or am I missing something? Breaking the API should be avoided even in master.
Comment 49 Albert Astals Cid 2012-09-11 18:06:30 UTC
Oh, wait, didn't realize the functions it's adding new parameters are private, ignore my "breaks API" part.
Comment 50 Carlos Garcia Campos 2012-09-11 18:23:07 UTC
(In reply to comment #49)
> Oh, wait, didn't realize the functions it's adding new parameters are private,
> ignore my "breaks API" part.

No problem :-) but please, ask me before committing any change that might break the public glib API even in master.
Comment 51 Dmitry Shachnev 2013-04-28 10:26:14 UTC
Unfortunately, this bug is still present in Poppler 0.20.5 on Ubuntu 13.04.

When I open my test case (attachment 66101 [details]) in Evince and try to select the "2m+α" part, it changes to "20+=" every time. Should I reopen this bug or file a new one instead?
Comment 52 Dmitry Shachnev 2013-04-28 10:30:07 UTC
(In reply to comment #51)
> Unfortunately, this bug is still present in Poppler 0.20.5 on Ubuntu 13.04.

Please ignore that, it seems that Poppler version is too old in Ubuntu. Will prepare an update now :)
Comment 53 Niklas Hambüchen 2014-08-17 20:43:11 UTC
Created attachment 104775 [details]
Unselected text with evince 3.10.3 (Ubuntu 14.04)
Comment 54 Niklas Hambüchen 2014-08-17 20:44:19 UTC
Created attachment 104776 [details]
Selected/Corrupted text with evince 3.10.3 (Ubuntu 14.04)

I'm not convinced this bug is fixed; see the attachment images from my evince 3.10.3 in Ubuntu 14.04.

Or is this a different issue?
Comment 55 Jason Crain 2014-08-18 00:00:09 UTC
(In reply to comment #54)
> I'm not convinced this bug is fixed; see the attachment images from my
> evince 3.10.3 in Ubuntu 14.04.
> 
> Or is this a different issue?

That is a different issue.  Please open a new bug, attach the PDF, and include your libpoppler version.
Comment 56 Jason Crain 2015-04-17 07:43:31 UTC
*** Bug 9608 has been marked as a duplicate of this bug. ***
Comment 57 Jason Crain 2015-04-18 00:12:15 UTC
*** Bug 13441 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.