Bug 104565 - UK tax form PDF has jumbled text
Summary: UK tax form PDF has jumbled text
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
: 91004 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-10 04:58 UTC by Jason Crain
Modified: 2018-01-21 21:26 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
blank United Kingdom tax return form (408.71 KB, application/pdf)
2018-01-10 04:58 UTC, Jason Crain
Details
result of "pdftocairo -png 'blank return.pdf' bad" (119.29 KB, image/png)
2018-01-10 05:01 UTC, Jason Crain
Details
result of "pdftocairo -png -f 3 -l 3 'blank return.pdf' good" (208.67 KB, image/png)
2018-01-10 05:02 UTC, Jason Crain
Details
GfxFontDict: merge reference generation from xpdf 4.00 (4.41 KB, patch)
2018-01-18 18:02 UTC, Jason Crain
Details | Splinter Review

Description Jason Crain 2018-01-10 04:58:32 UTC
Created attachment 136640 [details]
blank United Kingdom tax return form

Forwarding from https://bugzilla.gnome.org/792393

----------
The UK tax return form shows jumbled text when opened in Evince 3.18.2. It displays normally on Mac OS X with the default viewer, and I assume it does on Windows.
The jumbling starts on page 3, with odd;y-spaced commas replacing most, but not all, text. The page footer is replaced with random letters.
On page 6, the footer is back to normal, but all the body text is replaced with random letters and numbers.
The file is a fillable form. The problem was first noted with a copy containing my personal data. The attachment is a blank copy that also shows the problem.
----------

I've confirmed this with both pdftoppm and pdftocairo from poppler master.  Running "pdftocairo -png 'blank return.pdf' bad" produces a page 3 with much of the text replaced with commas.  Oddly, rendering just page 3 with "pdftocairo -png -f 3 -l 3 'blank return.pdf' good" works correctly.
Comment 1 Jason Crain 2018-01-10 05:01:20 UTC
Created attachment 136641 [details]
result of "pdftocairo -png 'blank return.pdf' bad"

This image shows an incorrect rendering of page 3 from running "pdftocairo -png 'blank return.pdf' bad".
Comment 2 Jason Crain 2018-01-10 05:02:45 UTC
Created attachment 136642 [details]
result of "pdftocairo -png -f 3 -l 3 'blank return.pdf' good"

This image shows the correct rendering of page 3 from running "pdftocairo -png -f 3 -l 3 'blank return.pdf' good".
Comment 3 Jason Crain 2018-01-10 20:37:19 UTC
CairoFontEngine.cc is caching fonts based on the indirect reference number and generation under the assumption that they will be unique, but a font on page 2 and 3 are aliasing so it uses the wrong font.  Splash is probably doing something similar.

Two different fonts have the same number and generation because these fonts don't really have an indirect reference due to the way the way the PDF defines the resources and font dictionaries:

7 0 obj
<<
  /Resources 8 0 R
  /Type /Page
  ... other page entries ...
>>
endobj

8 0 obj
<<
  /Font <<
    /T1 <<
      ... font dictionary entries ...
    >>
  >>
>>

The GfxFontDict constructor has code to generate a fake reference based on the /Font dictionary's number but that doesn't work well in this PDF because the Font dictionary doesn't have an indirect reference either.

This appears to be fixed in XPDF 4.00 because the GfxFontDict constructor now includes code to generate the fake reference based on a hash instead.
Comment 4 Jason Crain 2018-01-18 17:34:26 UTC
*** Bug 91004 has been marked as a duplicate of this bug. ***
Comment 5 Jason Crain 2018-01-18 18:02:05 UTC
Created attachment 136832 [details] [review]
GfxFontDict: merge reference generation from xpdf 4.00

The GfxFontDict constructor generates a fake indirect reference if the
font dictionary doesn't have a real indirect reference.  It sometimes
assigns the same reference to two different fonts leading to a wrong
font being used.  XPDF 4.00 fixes this by using the hash of the font
data to create the fake reference.
Comment 6 Albert Astals Cid 2018-01-21 21:26:35 UTC
Pushed :)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.