I've found that the version of poppler in current Ubuntu 14.04 (I have poppler-utils 0.24.5-2ubuntu1) has perhaps become too strict and won't parse PDF files that work fine in other readers. For example with the SPC-4 SCSI spec available from http://www.t10.org/cgi-bin/ac.pl?t=f&f=spc4r36q.pdf , I get: $ pdfinfo spc4r36q.pdf Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table However, the built-in firefox PDF renderer works with that file, as does mupdf. Both seem to be able to render the whole document and don't produce any warnings etc.
The file is very broken, Adobe Acrobat for Linux won't open it either. We can relax one of the checks and open it, but I am not sure if it will break other files. Basically the problem is that the linearization dict length and the file length do not match, if i comment that check in PDFDoc::isLinearized all works, but i am scared it may break some other files. OTOH the pdf spect says it may happen that linearization dict length and the file length do not match and speaks about how to recover, but... Hib, what's your opinion shall we relax that check in PDFDoc::isLinearized and run a regtest ? Or maybe turn it into at least an isOk that checks all the mandated fields of the dict are there?
Created attachment 94392 [details] Patch for the issue Actually this seems it should be pretty safe since we're only relaxing the condition if all failed. Hib, comments?
The reason a document length is allowed to not match a linearization dict length is that a pdf document can be modified by adding extra objects and a new xref to the end of a document. Clearly such a *modified* document has a length that is larger than the length specified in the linearization dict in the original document. When parsing a *modified* document one should not rely on the information in the linearization dict and/or hints table and fall back to the parsing method of non linearized documents. In the particular case of the document in http://www.t10.org/cgi-bin/ac.pl?t=f&f=spc4r36q.pdf, the document is probably modified and therefore threated as a non linearized document. However, when parsing it as a non linearized document, it appears to be very broken, and therefore poppler tries to reconstruct an xref table. That does not seem to work well, thus failing to render the *modified* document. Albert's patch adds a fallback which causes poppler to render the original *unmodified* linearized document. Now, the question is, is it usefull to present the orignal *unmodified* document to the user when the *modified* document is broken? I think it is not, because the document is clearly modified for a reason and presenting a document without the modifications is giving the user a false representation of it. But maybe that is what we always do to some extend with broken documents, so for me it can go either way.
(In reply to comment #2) > Created attachment 94392 [details] > Patch for the issue > > Actually this seems it should be pretty safe since we're only relaxing the > condition if all failed. Hib, comments? Seems safe indeed. When you think it is a good idea to present the *unmodified* document I think it will have no (other) undesired side effects.
That file (or at least the one I got from the uri) works fine here with current poppler master.
James tsdgeos@xps:~$ sha1sum spc4r36q.pdf 59b1c3af7215071b0331c805651b6cfdcd3b7eae spc4r36q.pdf ?
> tsdgeos@xps:~$ sha1sum spc4r36q.pdf > 59b1c3af7215071b0331c805651b6cfdcd3b7eae spc4r36q.pdf I got a file with that name with sha1sum: 244fe65d0b07fe02d7f56c7e2ca27f9c0dbc651b and: Title: SCSI Primary Commands -4 CreationDate: Fri Feb 7 21:30:13 2014 ModDate: Wed Feb 12 07:11:18 2014 File size: 4591300 bytes according to pdfinfo(1). so they seem to have pushed another update which fixed the format issues.
I've commited the patch, maybe what we can do is expose the wasReconstructed bool up so that graphical clients can show a message saying something like "The PDF is broken but we're still showing something", what do you think?
That's probably the best thing poppler can do.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.