Bug 85919

Summary: wrong render result
Product: poppler Reporter: nameX <lijunling>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: the pdf file
the test pdf file
the correct result
the wrong result
patch file

Description nameX 2014-11-05 15:08:12 UTC
Created attachment 108966 [details]
the pdf file

page 2 is bad
Comment 1 Albert Astals Cid 2014-11-05 15:12:20 UTC
You'll have to give more details
Comment 2 nameX 2014-11-06 00:33:45 UTC
Created attachment 108998 [details]
the test pdf file
Comment 3 nameX 2014-11-06 00:35:04 UTC
Created attachment 108999 [details]
the correct result
Comment 4 nameX 2014-11-06 00:35:33 UTC
Created attachment 109000 [details]
the wrong result
Comment 5 nameX 2014-11-19 05:57:12 UTC
Found why,  'startxref' is not in the last 1024bytes.

So I think should modify PDFDoc::getStartXRef
Comment 6 Albert Astals Cid 2014-11-19 19:07:12 UTC
How is this resolved?
Comment 7 nameX 2014-11-19 22:14:46 UTC
continue looking the second last 1024bytes and so on, until find it.
Comment 8 Albert Astals Cid 2014-11-19 22:33:02 UTC
Well, there's nothing resolved until the patch has landed master.
Comment 9 nameX 2014-11-20 07:38:43 UTC
Created attachment 109752 [details] [review]
patch file
Comment 10 Albert Astals Cid 2014-12-14 17:50:41 UTC
The patch looks like it fixes the issue and it does not cause any regression.

Can you please provide your name for proper copyright attribution and one line describing what the patch actually does?
Comment 11 nameX 2014-12-22 00:18:05 UTC
Julius Li(lijunling@sina.com)

Find last 'startxref' in whole file,not just in last 1024bytes.
Comment 12 Thomas Freitag 2014-12-22 08:27:10 UTC
I'm not sure if it is a good idea to search for the last startxref in a PDF without at least a warning that it wasn't found at the end of the PDF, it could break the recovery algorithm:
According to the spec the startxref entry should be at the end of a PDF file:

% Offset of first-page cross-reference table (part 3)
startxref
257
%% EOF

And there could be several startxref entries in a PDF file: in case of an incremental update new PDF objects and the new xref table and the new pointer to the xref table are just appended. So if You don't find the startxref at the end of the file there is a good chance that the PDF file is broken and should be recovered!
Comment 13 Albert Astals Cid 2014-12-22 15:12:56 UTC
Good idea Thomas, maybe this should be only done in case of tryingToReconstruct==true ?
Comment 14 Albert Astals Cid 2014-12-24 14:51:20 UTC
No, trying tryingToReconstruct == true won't work as it's never true in this case. I wonder if the difference is because we're having linearization into account and adobe not?
Comment 15 Thomas Freitag 2014-12-28 11:12:44 UTC
(In reply to Albert Astals Cid from comment #14)
> No, trying tryingToReconstruct == true won't work as it's never true in this
> case. I wonder if the difference is because we're having linearization into
> account and adobe not?

No, Albert. I found now the time to look at the PDF: it is as I guessed, it was an incremental update, but the incremental update section after the original startxref is damaged. It starts with binary data, then have some correct obj's, but also the new startxref is missing.
The poppler code without the patch now does not find the xref table and immediately reconstructs the xref table (because of startXRefPos = 0) with a scan of the complete PDF file. This is possible with the PDF, so xref->isOk() is gTrue. But some of the defect objects (esp. not complete objects) of the incremental update overwrites the original objects, therefore it's then no more possible to render the second page correctly.
So thinking a little bit more Julius Li patch I think it is okay: if it doesn't find a xref section at the end it tries to find one before, so it finds the xref table if existing of the last correct incremental update. If it doesn't find any it runs in the old manner and reconstructs the xref table by parsing the PDF.
Comment 16 Albert Astals Cid 2014-12-30 22:42:43 UTC
I was wondering, should we just not do something like

diff --git a/poppler/PDFDoc.cc b/poppler/PDFDoc.cc
index ec8d3df..c4dd43d 100644
--- a/poppler/PDFDoc.cc
+++ b/poppler/PDFDoc.cc
@@ -92,7 +92,7 @@
                                        // file to look for linearization
                                        // dictionary
 
-#define xrefSearchSize 1024    // read this many bytes at end of file
+#define xrefSearchSize 16384   // read this many bytes at end of file
                                //   to look for 'startxref'
 
 //------------------------------------------------------------------------


?

After all PDFDoc::getStartXRef will read from the back anyway, no?
Comment 17 Albert Astals Cid 2015-01-02 16:24:43 UTC
I commited nameX fix but limiting to 24K instead of the whole file, didn't want us to be reading too much from disk if we're passed some very long crap.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.