Bug 102760 - pdftops generates crumbled text
Summary: pdftops generates crumbled text
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-14 20:36 UTC by Freek de Kruijf
Modified: 2017-09-20 17:38 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
screen shot from PDF screen (11.50 KB, image/png)
2017-09-14 20:36 UTC, Freek de Kruijf
Details
screen shot from ps file (3.95 KB, image/png)
2017-09-14 20:37 UTC, Freek de Kruijf
Details
proposed patch (2.17 KB, patch)
2017-09-18 19:57 UTC, William Bader
Details | Splinter Review

Description Freek de Kruijf 2017-09-14 20:36:58 UTC
Created attachment 134244 [details]
screen shot from PDF screen

Attached are two screenshots, one is the text of the PDF file, the other is the text generated, apparently using poppler, by pdftops.
Most likely the bar code causes the trouble.
On request the PDF file can be made available, but only to the developer.
Comment 1 Freek de Kruijf 2017-09-14 20:37:51 UTC
Created attachment 134245 [details]
screen shot from ps file
Comment 2 Albert Astals Cid 2017-09-14 20:51:40 UTC
Please share the document with me, though i can't guarantee a quick fix
Comment 3 Albert Astals Cid 2017-09-16 11:11:42 UTC
William this regression is caused by https://cgit.freedesktop.org/poppler/poppler/commit/?id=2cf901c817fc99e1fa57745c11aa79cdfb4e8c99 i.e. the fix for bug https://bugs.freedesktop.org/show_bug.cgi?id=63963

Freek would you mind if I share the document with William too if he's willing to have a look why he's patch caused this regression?
Comment 4 William Bader 2017-09-16 21:31:59 UTC
I can look at it if someone can send me the PDF.
The basic idea of my change is the line
  } else if (maxGlyph > 0 && code > maxGlyph) {
Some fonts define only the non-empty glyphs but reference zero-size glyphs way off the end of the list. Without the code > maxGlyph test, the generated postscript will be invalid. Without seeing the PDF, I suppose that somehow maxGlyph isn't correct.
William
Comment 5 Freek de Kruijf 2017-09-17 20:40:42 UTC
(In reply to Albert Astals Cid from comment #3)
> William this regression is caused by
> https://cgit.freedesktop.org/poppler/poppler/commit/
> ?id=2cf901c817fc99e1fa57745c11aa79cdfb4e8c99 i.e. the fix for bug
> https://bugs.freedesktop.org/show_bug.cgi?id=63963
> 
> Freek would you mind if I share the document with William too if he's
> willing to have a look why he's patch caused this regression?

It's OK to share the file with William.
Comment 6 Albert Astals Cid 2017-09-17 21:08:44 UTC
Sent
Comment 7 William Bader 2017-09-18 03:53:07 UTC
When I run
  pdffonts -f 1 -l 1 bug.pdf
it lists ArialBold twice.
The first copy has 19 glyphs, and the second copy has 15 glyphs.
The missing characters are glyphs in positions 16 to 19 of the first copy because after reading the second copy, it thinks that the font has only 15 glyphs and avoids accessing positions 16 to 19.

The question is how to handle it.
In my test files, the last defined glyph is under 200, the referenced glyphs are in the 1000's, and each font is in the pdf only once.

1. When I look at the PDF, object 59 starts /BaseFont/ArialBold/ and objects 57 and 58 start /DescendantFonts ... /BaseFont/ArialBold, so maybe there is a way to differentiate the copy with 19 glyphs from the copy with 15 glyphs, although I think that information is lost by the time that it gets into PSOutputDev::drawString(), which is why I had create the hash and couldn't just add a maxGlyphs field to GfxFont.

2. I could change the code in PSOutputDev::setupExternalCIDTrueTypeFont() and PSOutputDev::setupEmbeddedCIDTrueTypeFont() so that when it sees a font for the second time, instead of always updating the hash mapping font names to glyph counts, it could update the glyph count only if the new number is larger. That should fix this file without breaking my test files. If this is OK, I can submit a patch within a day or two.

William
Comment 8 Freek de Kruijf 2017-09-18 09:24:41 UTC
The choice for a solution is up to you. As a user I am only interested in a proper print. In case the PDF is improperly generated I can report back to the organization I got the PDF from, with a description of what is at fault.
Comment 9 Albert Astals Cid 2017-09-18 18:00:30 UTC
We could try #2 if that's easier but this means that your patch maybe broke lots of files more since it's quite common to have duplicated names in pdf files.

Let's try patch #2 and then i'll run a full regtest of pdftops without your original patch and after the original patch + this one to see if there's still regressions.

Does that sound ok?
Comment 10 William Bader 2017-09-18 19:57:24 UTC
Created attachment 134321 [details] [review]
proposed patch

This patch moves the code to update the max valid glyph hash into its own function and updates the max valid glyph only if the new value is higher than the previous value.
This fixes a problem with pages that have multiple copies of the same font with different glyph counts. If poppler processed the font with the smaller count last, and then the PDF wrote text in the font with the larger count, pdftops would not show the glyphs above the maximum of the smaller font.
I suspect that this issue is rare because the original issue https://bugs.freedesktop.org/show_bug.cgi?id=63963 was reported in 2013, not touched for several years in poppler, and still exists in the recently released xpdf-4.00. Also, my patch was applied in December 2016 (9 months ago), and this is the first reported regression.
Regards, William
Comment 11 Albert Astals Cid 2017-09-20 17:38:54 UTC
Pushed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.