Bug 75232 - poppler can't parse seemingly OK PDF file
Summary: poppler can't parse seemingly OK PDF file
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-19 23:09 UTC by Roland Dreier
Modified: 2014-02-27 05:43 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Patch for the issue (2.97 KB, text/plain)
2014-02-20 00:04 UTC, Albert Astals Cid
Details

Description Roland Dreier 2014-02-19 23:09:00 UTC
I've found that the version of poppler in current Ubuntu 14.04 (I have poppler-utils 0.24.5-2ubuntu1) has perhaps become too strict and won't parse PDF files that work fine in other readers.  For example with the SPC-4 SCSI spec available from http://www.t10.org/cgi-bin/ac.pl?t=f&f=spc4r36q.pdf , I get:

$ pdfinfo spc4r36q.pdf 
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

However, the built-in firefox PDF renderer works with that file, as does mupdf.  Both seem to be able to render the whole document and don't produce any warnings etc.
Comment 1 Albert Astals Cid 2014-02-19 23:52:47 UTC
The file is very broken, Adobe Acrobat for Linux won't open it either. We can relax one of the checks and open it, but I am not sure if it will break other files.

Basically the problem is that the linearization dict length and the file length do not match, if i comment that check in PDFDoc::isLinearized all works, but i am scared it may break some other files.

OTOH the pdf spect says it may happen that linearization dict length and the file length do not match and speaks about how to recover, but...

Hib, what's your opinion shall we relax that check in PDFDoc::isLinearized and run a regtest ?

Or maybe turn it into at least an isOk that checks all the mandated fields of the dict are there?
Comment 2 Albert Astals Cid 2014-02-20 00:04:09 UTC
Created attachment 94392 [details]
Patch for the issue

Actually this seems it should be pretty safe since we're only relaxing the condition if all failed. Hib, comments?
Comment 3 Hib Eris 2014-02-20 16:09:19 UTC
The reason a document length is allowed to not match a linearization dict length is that a pdf document can be modified by adding extra objects and a new xref to the end of a document. Clearly such a *modified* document has a length that is larger than the length specified in the linearization dict in the original document. When parsing a *modified* document one should not rely on the information in the linearization dict and/or hints table and fall back to the parsing method of non linearized documents.

In the particular case of the document in http://www.t10.org/cgi-bin/ac.pl?t=f&f=spc4r36q.pdf, the document is probably modified and therefore threated as a non linearized document. However, when parsing it as a non linearized document, it appears to be very broken, and therefore poppler tries to reconstruct an xref table. That does not seem to work well, thus failing to render the *modified* document.

Albert's patch adds a fallback which causes poppler to render the original *unmodified* linearized document. 

Now, the question is, is it usefull to present the orignal *unmodified* document to the user when the *modified* document is broken?

I think it is not, because the document is clearly modified for a reason and presenting a document without the modifications is giving the user a false representation of it.

But maybe that is what we always do to some extend with broken documents, so for me it can go either way.
Comment 4 Hib Eris 2014-02-20 16:14:27 UTC
(In reply to comment #2)
> Created attachment 94392 [details]
> Patch for the issue
> 
> Actually this seems it should be pretty safe since we're only relaxing the
> condition if all failed. Hib, comments?

Seems safe indeed. When you think it is a good idea to present the *unmodified* document I think it will have no (other) undesired side effects.
Comment 5 James Cloos 2014-02-20 22:18:39 UTC
That file (or at least the one I got from the uri) works fine here with
current poppler master.
Comment 6 Albert Astals Cid 2014-02-20 22:59:10 UTC
James

tsdgeos@xps:~$ sha1sum spc4r36q.pdf
59b1c3af7215071b0331c805651b6cfdcd3b7eae  spc4r36q.pdf

?
Comment 7 James Cloos 2014-02-21 00:12:39 UTC
> tsdgeos@xps:~$ sha1sum spc4r36q.pdf
> 59b1c3af7215071b0331c805651b6cfdcd3b7eae  spc4r36q.pdf

I got a file with that name with sha1sum:

244fe65d0b07fe02d7f56c7e2ca27f9c0dbc651b

and:

  Title:          SCSI Primary Commands -4
  CreationDate:   Fri Feb  7 21:30:13 2014
  ModDate:        Wed Feb 12 07:11:18 2014
  File size:      4591300 bytes

according to pdfinfo(1).

so they seem to have pushed another update which fixed the format issues.
Comment 8 Albert Astals Cid 2014-02-26 21:10:52 UTC
I've commited the patch, maybe what we can do is expose the wasReconstructed bool up so that graphical clients can show a message saying something like "The PDF is broken but we're still showing something", what do you think?
Comment 9 Hib Eris 2014-02-27 05:43:19 UTC
That's probably the best thing poppler can do.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.