Description
Sebastien Bacher
2006-05-15 05:53:22 UTC
That's still happening using the cairo backend and poppler 0.6 Created attachment 18147 [details]
Minimal pdf that reproduces the problem to help debugging
I've located exactly when the bug appears, and I've hand-coded a ***minimal*** PDF as a testcase. I'm attaching it, when selecting the text you should see
BCDΓ
transforming to
ABC<display artifact>
(notice the change from BCD to ABC).
Kind regards,
Alkis Georgopoulos
Anyone? Can you try with a newer poppler? The example on this bug is still buggy on 0.11.3 The example on this bug is still buggy on Evince 2.29.3 with poppler 0.12.0 evincebug.pdf still displays the buggy behavior in evince 2.30.3, poppler 0.12.4 For the problem in the minimal example, the culprit seems to be in TextSelectionPainter::visitWord, which assumes that all the chars in a word use the same font. This is demonstrated by adding a space After "Word:" in the bad line. This also explain why this bug is very common with LaTeX files. Created attachment 43360 [details] [review] Patch fixing the issue Actually, All the chars in a TextWord MUST share the same font, so I added a check so we also create a new word when the font changes. The attached patch fixes the bug in the minimal example and many other problems when selecting text in PdfLatex-generated files, but there are still issues in rendering selections with Bitmapped fonts. In this case, the glyph seems right but the size is wrong. The last patch does not fix either some problems when selecting ligatures in some fonts, see http://java.sun.com/docs/books/jvms/second_edition/ClassFileFormat-final-draft.pdf, page 54 and look for the word "reflective" which is rendered as "reßective" (althought the text select is indeed "reflective". This does not happen with all fonts, for instance, in a pdflatex generated file, I could not reproduce the problem. You are comparing the pointers of the font infos, shouldn't you be comparing the contents of the pointers? I assumed that the only code that is updating the curFont is TextPage::updateFont. If this is true, from what I understood from the code, the fonts array holds unique FontInfos, so it would be safe to compare pointers instead of FontInfos. So, is this right? If so, and you still want me to rework the patch so it compares the fonts infos instead of the pointers, I won't argue and will update the patch, but I'd appreciate an answer so I can have a better understanding of the code, Thanks Created attachment 43420 [details] [review] alternate patch comparing gfxFont instead of pointers to FontInfos Even if what I say in last comment may be true, I followed your advice and updated the patch to compare the gfxFonts instead of the pointers to the TextInfos. Created attachment 43421 [details] [review] updated alternate patch I apologize for the spam. Previous patch was clearly wrong. Updated patch. The patch changes (breaks) the behaviour of pdftotext so that is not acceptable. You can see that in the attached pdf "bold strike:" changes to be extracted to "bold strike :" Created attachment 43455 [details]
The pdf with the pdftotext regression
Thanks for the regression. The problem here is that "bold:" has two fonts since "bold" is italicised and ":" not, so before the patch, "bold:" is a TextWord and pdftotext get the text right, but drawing the selection is bad since the selected ":" would be drawn italicised. After the patch, "bold:" gets splitted up into "bold" and ":" so it gets drawn correctly when selected, but you have the regression you pointed out. So, I believe I am stuck with this choice: a). I could either allow more than one font on a TextWord, and adapt the code that draws the TextWord to use that fact, so the selected ":" does not get transform to a italicised ":" when drawing it. b) or I could fix the TextDumper to be aware of the fact in some cases there is no space between two TextWord. c) Do you have another way? I think I want to take approach a) even if it could more complicated, but approach b) seems that it could break more things that it would fix... thinking that when doing selection by words would not work at all. What do you think? It's perfectly natural that every letter in a word be in a different font, colour, or style so that needs to be allowed for (the text could even a scan of some handwriting, or several scans), . ...what pdftotext is trying to do is to replicate what would be found in a Tagged PDF (Section 9.7 of the PDF Reference), effectively synthesising what would ideally be found in '/BMC.../EMC' '/ActualText' contents. To that end I'm inclined to agree that (b) sounds like non-ideal kludge on top of non-ideal kludge. I tend to agree with Paul that is fine for a word to have different fonts, but on the other hand one can argue if the ":" in that situation is part of the word or not. I'd say 1 is a better solution but if you can get 2 with a "clean enough" code i won't be oposed to it either. Personally i think text selection should have been done in client side (like we do in Okular) and not "modyfing" poppler internal structures but that's a bit too late i guess :D Albert: given the lack of a space ( ) in the input, there's shouldn't be one in the output! :) Paul: Of course, i never said this is an acceptable behaviour. (What makes you think i did?) I only said that from the technical point of view you can argue that ":" is of course a different word since you won't find the word "strike:" in a dictionary. As there been any additional progress on this issue? I must say I have been encountering it quite frequently for a long time, and for one would be very happy to see a fix applied. Anyway, thanks to anyone working on this. An interesting thing: When I select the "Wrong" line in PDF from comment 2, BCDГ is changing not to ABC<display artifact>, but to ABC<non-selected letter Г>, and when I then copy-paste that line, I get "Wrong:BCD", without the last letter. @José: Please, do something with this! Sometimes selecting text does really terrible things (take a look at 3 screenshots in this archive: http://ubuntuone.com/p/11Pa/, for example). Dmitry please to not modify those fields, that's not for users to choose. Created attachment 65449 [details] [review] Allow multiple fonts in a TextWord This patch modifies TextWord so that each character can be a different font. Function poppler_page_get_text_attributes is updated to allow for attributes to change mid-word. (In reply to comment #25) > Created attachment 65449 [details] [review] [review] > Allow multiple fonts in a TextWord > > This patch modifies TextWord so that each character can be a different font. > Function poppler_page_get_text_attributes is updated to allow for attributes to > change mid-word. Thanks for the patch!, Albert could you run the regtests for text backend with this patch, please? Sure, on my todo (man i'm behind again in poppler :-() Sure, no hurry, thanks! Good news first, it does not change the output of pdftotext (well it does it in a file, but i don't consider it a regression) Yay :-) Now onto the review of the code itself. From what i can read it seems it is possible to have in the same word fonts with different WMode (i.e. fonts written vertically and horizontally) i don't think this makes any sense. What do you think? Note i have *not* verified it fixes the problem mentioned in this bug. Carlos can you do that part? There's also a weird thing. If i get all the character positions and dump them to file, sometimes there's minor changes, e.g. in https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/881407/+attachment/2571595/+files/Manning-ExtJS_in_Action.pdf the diff says --- old/foo 2012-08-17 01:16:21.566960179 +0200 +++ new/foo 2012-08-17 01:16:21.566960179 +0200 @@ -51275,13 +51275,16 @@ QRectF(289.28,579.535 2.4948x4.725) true QRectF(289.28,579.535 2.4948x4.725) -"FumiK" -QRectF(294.666,579.535 10.0606x4.725) -true +"Fumi" +QRectF(294.666,579.535 7.64003x4.725) +false QRectF(294.666,579.535 2.08791x4.725) QRectF(296.754,579.535 1.73864x4.725) QRectF(298.492,579.535 2.8904x4.725) -QRectF(301.383,579.535 0.349866x4.725) +QRectF(301.383,579.535 0.923076x4.725) +"K" +QRectF(301.733,579.535 2.99376x4.725) +true QRectF(301.733,579.535 2.99376x4.725) "®" QRectF(308.586,580.274 3.06445x3.885) the more interesting part is -QRectF(301.383,579.535 0.349866x4.725) +QRectF(301.383,579.535 0.923076x4.725) it seems the "i" moved from being 0.349866 wide to being 0.923076. Any idea why those changes happen? Looking at the code i don't see any obvious reason for them. (In reply to comment #29) > Note i have *not* verified it fixes the problem mentioned in this bug. Carlos > can you do that part? it's weird, I don't see any difference with and without the patch with poppler-glib-demo and the minimal example attached to this bug. The difference in character width is because of this change: @@ -863,7 +839,7 @@ void TextLine::coalesce(UnicodeMap *uMap) { word0->spaceAfter = gTrue; word0 = word1; word1 = word1->next; - } else if (word0->font == word1->font && + } else if (word0->font[word0->len - 1] == word1->font[0] && word0->underlined == word1->underlined && fabs(word0->fontSize - word1->fontSize) < maxWordFontSizeDelta * words->fontSize && Before, the words "Fumi" and "K" were being merged. Letters "F" and "K" are using the font BookmanOldStyle. Letters "umi" are using ArialMT. Because of this font check, the words are no longer being merged, making the width of the i different. Yes, it is possible for one word to have fonts with different WModes. Still looking into it, but I suppose it will need to start a new word on WMode changes. Carlos: I don't know why you do not see a difference. Without this patch, selecting text in the minimal example changes the text to "ABC". With this patch, it stays "BCD". (In reply to comment #32) > The difference in character width is because of this change: > > @@ -863,7 +839,7 @@ void TextLine::coalesce(UnicodeMap *uMap) { > word0->spaceAfter = gTrue; > word0 = word1; > word1 = word1->next; > - } else if (word0->font == word1->font && > + } else if (word0->font[word0->len - 1] == word1->font[0] && > word0->underlined == word1->underlined && > fabs(word0->fontSize - word1->fontSize) < > maxWordFontSizeDelta * words->fontSize && > > > Before, the words "Fumi" and "K" were being merged. Letters "F" and "K" are > using the font BookmanOldStyle. Letters "umi" are using ArialMT. Because of > this font check, the words are no longer being merged, making the width of the > i different. I understand why they are not merged (by code not, why the code is put there), but why the "i" has a different width that makes it overlap with the "K" when previously didn't? (In reply to comment #33) > I understand why they are not merged (by code not, why the code is put there), > but why the "i" has a different width that makes it overlap with the "K" when > previously didn't? Before being merged, the width of the 'i' is 0.923076. In TextWord::merge, the right edge of 'i' is set equal to the left edge of 'K', reducing its width to 0.349866. Ok, let's not worry about that for the moment. I'm more concerned about Carlos not being able to verify this indeed fixes the problem it's supposed to be fixing. Carlos can you give it a second try? I tried patched poppler with the minimal example and I don't get the change of the glyphs anymore under the selection, which is VERY GOOD!! But I can't select the Gamma glyph... Created attachment 65992 [details] [review] Allow multiple fonts in a TextWord It was possible for a TextWord to have fonts for both horizontal and vertical writing modes. This version of the patch starts a new word when the WMode changes. I am not sure how to fix the problem with the Gamma glyph not being selectable. It is not selectable because it has a zero width, because there is no Widths array in the file. This is a violation of the spec. Poppler has a mechanism to provide default widths, but this does not work for symbols. Created attachment 66101 [details]
Another small test case
The patch doesn't fully solve the problem for me. I have attached another PDF file (which only contains one math formula) with which I still can reproduce the issue.
When I try to select the Sigma ("∑") in Evince, limits ("m=0" and "∞") disappear; when I try to select "x/2" fraction, I get an orange or black rectangle (I use Ubuntu's Ambiance theme, so normally selection color should be always orange).
I can confirm that the "Minimal pdf" attached here is fixed by this patch, though.
Jason want me to have another look at your patch or you want to try to fix the issue by Dmitry first? Yes, please review the patch as it is. I have looked at the document from Dmitry. It exposes two issues, but these are separate from minimal pdf. 1) TextLines may be drawn over each other, obscuring other text. 2) In CairoOutputDev::updateFillColor, the color may not be set correctly. Created attachment 66163 [details] [review] Change default fallback character width Also, I don't know how you feel about fixing something caused by violating the PDF specification, but this patch changes the default width for a missing character, making the gamma symbol selectable in the minimal pdf. (In reply to comment #37) > Created attachment 65992 [details] [review] [review] > Allow multiple fonts in a TextWord > > It was possible for a TextWord to have fonts for both horizontal and vertical > writing modes. This version of the patch starts a new word when the WMode > changes. When running pdftotext with this patch on a pdf i'm going to attach it crashes (without the patch it works), can you please fix it? Created attachment 66375 [details]
The pdf where the patch crashes
Created attachment 66450 [details] [review] Check for NaN in TextPage::addChar I don't think this is related to my earlier patch. For me, this pdf crashes both with and without it. This document is doing very strange things with the current transformation matrix (CTM) and inline images. Pages 6 and 15 are filled with lines like this: q 18 0 0 -1 2782 6350 cm q BI <IMAGE DICT> ID <IMAGE DATA> EI Q q 19 0 0 -1 2782 6350 cm q BI <IMAGE DICT> ID <IMAGE DATA> EI Q Note the unbalanced q/Q for saving/restoring the graphics state. This means that the graphics state is not ever being properly restored and the `cm' operator is scaling the CTM until its components become NaN. This leads to TextWord::base being NaN. This breaks calculations in TextPool::addWord, causing wordBaseIdx to be INT_MIN, causing the text pool to not be initialized to NULLs, which causes a crash when an invalid pointer is read and dereferenced from the pool. As a test, adding a call to restoreState() in Gfx::opBeginImage allows the page to render properly and without crashing. Otherwise, poppler either crashes or places text in an invalid location. The attached patch adds a check for NaN to TextPage::addChar and throws away chars with invalid positions. Created attachment 66454 [details] [review] Allow multiple fonts in a TextWord Updated patch to apply to current git master. I've commited this patch to master (what will be 0.22) and not to 0.20 since it breaks the public glib api I'm closing this bug since it fixes the original reported issue. If there are issues not fixed by this bug please open a new one, big bugs get difficult to keep track of otherwise. (In reply to comment #47) > I've commited this patch to master (what will be 0.22) and not to 0.20 since it > breaks the public glib api Really? why? I agree with not committing it to the stable branch, but I don't think the patch breaks the API, or am I missing something? Breaking the API should be avoided even in master. Oh, wait, didn't realize the functions it's adding new parameters are private, ignore my "breaks API" part. (In reply to comment #49) > Oh, wait, didn't realize the functions it's adding new parameters are private, > ignore my "breaks API" part. No problem :-) but please, ask me before committing any change that might break the public glib API even in master. Unfortunately, this bug is still present in Poppler 0.20.5 on Ubuntu 13.04.
When I open my test case (attachment 66101 [details]) in Evince and try to select the "2m+α" part, it changes to "20+=" every time. Should I reopen this bug or file a new one instead?
(In reply to comment #51) > Unfortunately, this bug is still present in Poppler 0.20.5 on Ubuntu 13.04. Please ignore that, it seems that Poppler version is too old in Ubuntu. Will prepare an update now :) Created attachment 104775 [details]
Unselected text with evince 3.10.3 (Ubuntu 14.04)
Created attachment 104776 [details]
Selected/Corrupted text with evince 3.10.3 (Ubuntu 14.04)
I'm not convinced this bug is fixed; see the attachment images from my evince 3.10.3 in Ubuntu 14.04.
Or is this a different issue?
(In reply to comment #54) > I'm not convinced this bug is fixed; see the attachment images from my > evince 3.10.3 in Ubuntu 14.04. > > Or is this a different issue? That is a different issue. Please open a new bug, attach the PDF, and include your libpoppler version. *** Bug 9608 has been marked as a duplicate of this bug. *** *** Bug 13441 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.