Created attachment 108966 [details] the pdf file page 2 is bad
You'll have to give more details
Created attachment 108998 [details] the test pdf file
Created attachment 108999 [details] the correct result
Created attachment 109000 [details] the wrong result
Found why, 'startxref' is not in the last 1024bytes. So I think should modify PDFDoc::getStartXRef
How is this resolved?
continue looking the second last 1024bytes and so on, until find it.
Well, there's nothing resolved until the patch has landed master.
Created attachment 109752 [details] [review] patch file
The patch looks like it fixes the issue and it does not cause any regression. Can you please provide your name for proper copyright attribution and one line describing what the patch actually does?
Julius Li(lijunling@sina.com) Find last 'startxref' in whole file,not just in last 1024bytes.
I'm not sure if it is a good idea to search for the last startxref in a PDF without at least a warning that it wasn't found at the end of the PDF, it could break the recovery algorithm: According to the spec the startxref entry should be at the end of a PDF file: % Offset of first-page cross-reference table (part 3) startxref 257 %% EOF And there could be several startxref entries in a PDF file: in case of an incremental update new PDF objects and the new xref table and the new pointer to the xref table are just appended. So if You don't find the startxref at the end of the file there is a good chance that the PDF file is broken and should be recovered!
Good idea Thomas, maybe this should be only done in case of tryingToReconstruct==true ?
No, trying tryingToReconstruct == true won't work as it's never true in this case. I wonder if the difference is because we're having linearization into account and adobe not?
(In reply to Albert Astals Cid from comment #14) > No, trying tryingToReconstruct == true won't work as it's never true in this > case. I wonder if the difference is because we're having linearization into > account and adobe not? No, Albert. I found now the time to look at the PDF: it is as I guessed, it was an incremental update, but the incremental update section after the original startxref is damaged. It starts with binary data, then have some correct obj's, but also the new startxref is missing. The poppler code without the patch now does not find the xref table and immediately reconstructs the xref table (because of startXRefPos = 0) with a scan of the complete PDF file. This is possible with the PDF, so xref->isOk() is gTrue. But some of the defect objects (esp. not complete objects) of the incremental update overwrites the original objects, therefore it's then no more possible to render the second page correctly. So thinking a little bit more Julius Li patch I think it is okay: if it doesn't find a xref section at the end it tries to find one before, so it finds the xref table if existing of the last correct incremental update. If it doesn't find any it runs in the old manner and reconstructs the xref table by parsing the PDF.
I was wondering, should we just not do something like diff --git a/poppler/PDFDoc.cc b/poppler/PDFDoc.cc index ec8d3df..c4dd43d 100644 --- a/poppler/PDFDoc.cc +++ b/poppler/PDFDoc.cc @@ -92,7 +92,7 @@ // file to look for linearization // dictionary -#define xrefSearchSize 1024 // read this many bytes at end of file +#define xrefSearchSize 16384 // read this many bytes at end of file // to look for 'startxref' //------------------------------------------------------------------------ ? After all PDFDoc::getStartXRef will read from the back anyway, no?
I commited nameX fix but limiting to 24K instead of the whole file, didn't want us to be reading too much from disk if we're passed some very long crap.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.