Created attachment 111220 [details] The said file With the attached file, if you split it with pdfseparate, the file that represents the third page will have an extra square on the top right part of the page when rendered with pdftoppm that is not there if you render the third page of the original file.
Thomas is this something you can have a look?
(In reply to Albert Astals Cid from comment #1) > Thomas is this something you can have a look? I will have a look at it
Created attachment 111850 [details] [review] use copy of xref in savePageAs so that it can be safely modified The problem was the removing of page annotations which were not referenced by the separated page. This modified the xref of the document. So during the separation of page 1 the page annotations of page 3 were removed and were therefore no more available when page 3 was separated. You can see that when You use i.e. pdfseparate -f 3 -l 3 with the said PDF. In this case the separated PDF of page 3 is okay. To solve this problem the xref of the document in this patch first will copied and so now can be safely modified, and after the work is done the xref will be restored.
This seems to break with the attached file
Created attachment 113165 [details] File that gets broken with proposed patch
Created attachment 113535 [details] [review] use always an unique instance for PDFDoc for savePageAs To solve this problem now a unique instance of PDFDoc is used so that it can be safely modified!
Yep, that one fixes this one and has no regression i can find. There's still some files that generate different content on pdfseparated files than orignal, i'll attach the next one in a sec.
Created attachment 113803 [details] File that renders different separated than in original
(In reply to Albert Astals Cid from comment #8) > Created attachment 113803 [details] > File that renders different separated than in original Funny use of a /P entry in an annotation dictionary. The spec says: (Optional except as noted below; PDF 1.3; not used in FDF files) An indirect reference to the page object with which this annotation is associated. But here the /P doesn't point to a page object (or at least not to a dictionary of type page), it points to a dictionary without any type and just some information which normally are use in a page dictionary. I used the /P entry to decide if I can remove that annotation because it references to a page which will not be separated. Seems as if I need to refine that test a little bit :-)
Created attachment 113837 [details] [review] Refinement of the /P annotation test This patch solves the issue of comment 8
Patch from comment 10 commited!
Next file that breaks, first page of https://bugs.kde.org/attachment.cgi?id=78512 renders white after pdfseparate
Created attachment 118096 [details] [review] refine resource detection In case of the PDF from comment 11 the resource dict used by the first page is neither refered by the catalog dict not by the page dict itself. Therefore we need to refine the search for the resource dict used by the page. This is done by this new patch.
Commited patch from comment #13
Unfortunately there's some more files that render slightly different, the next one i'm attaching i can't see difference visually but a pixel per pixel comparison does yield some differences.
Created attachment 121505 [details] File mentioned on comment #15
Hmmh, Albert, I'm not able to see any differences between the png's created with following commands, even not by a pixel by pixel comparison: ./utils/pdfseparate 87637.open/bug306008.pdf output/bug306008.pdf ./utils/pdftoppm -png -cropbox 87637.open/bug306008.pdf output/bug306008-orig ./utils/pdftoppm -png -cropbox output/bug306008.pdf output/bug306008-sep But there is a difference between the PDF separated and the original: MediaBox[0 0 455.3858267716536 595.1692913385828] MediaBox[0 0 455.3858267717 595.1692913386] The reason for it is in void PDFDoc::writeObject (Object* obj, OutStream* outStr, XRef *xRef, Guint numOffset, Guchar *fileKey, CryptAlgorithm encAlgorithm, int keyLength, int objNum, int objGen) : : : s.appendf("{0:.10g}", obj->getReal()); Can You change that in Your test to s.appendf("{0:.15g}", obj->getReal()); and look, if there are still pixel changes?
No that doesn't help, somehow it ends up breaking the file output. I realized it goes away if i don't use -r 72 on the pdftoppm call, any idea why that may be?
(In reply to Albert Astals Cid from comment #18) > No that doesn't help, somehow it ends up breaking the file output. > Oh, there is a bug in the goo formatter which produces illegal floating point strings in this case. > I realized it goes away if i don't use -r 72 on the pdftoppm call, any idea > why that may be? Yes, it's definetely the difference in the mediabox. If You have an exact look at the pixel which changed You will encounter that they are always at the beginning or end of a line which is drawn by the stroke command. The comes from the different mediabox and anti aliasing effects computed for the start or the end of a stroke path. With 144 dpi the difference is small enough so that the output is the same. With pdfcairo there are no differences in the output. Also if You compile poppler with #define USE_FIXEDPOINT 1 You will see no difference. But to prove it beyond doubt I will upload a small patch just for testing purposes with rounds the mediabox at the tenth decimal place :-)
Created attachment 121536 [details] [review] round mediabox at tenth decimal place If You apply this test patch, create the png's once again You will see no difference anymore
ok, it's mediabox, i think i have other files that still fail, will try to find one later.
Next one, the text on https://bugs.freedesktop.org/attachment.cgi?id=50980 after running through pdfseparate is all broken.
I'm not sure, if I can solve this without breaking the fact, that pdfseparate just looks at the PDF structure: The PDF of comment 22 is broken, pdftoppm says "Syntax Error: Couldn't find trailer dictionary" and if You have a look into it, the xref section at the end just contains 12 entries, where the page needs at least 40 to render it correctly. But I will look if I can detect defect PDFs early enough and call the repair algorithm BEFORE the separate algorithm starts.
I found now the reason: In case of the PDF of comment 22 the xref table will never be reconstructed by pdfseparate, because the trailer, the startxref as well as the xref section are valid, just that the xref section doesn't contain all objects needed to render the PDF. And since pdfseparate first calls xref->scanSpecialFlags() which calls readXRefUntil(-1), the constructXRef branch in readXRefUntil will never be reached anymore. I'm not sure if this could not cause problems in other areas, too, but I finally decided to write a new method getEntryAndReconstruct() which now is called only by PDFDoc::markObject instead of simple getEntry() and which tries to reconstruct the xref table if a num should be fetched which doesn't exist in the xref table.
Created attachment 121586 [details] [review] reconstruct xref table if xrefs are missing but needed This patch also has the advantage that it repairs the PDF of comment 22, so the resulting PDF is sanitized J
patch of comment 25 is obsolete with solution for bug 94260, so PDF of comment 22 will be handled correctly when bug 94260 is fixed!
Found a better solution attached to bug 92508
Created attachment 122326 [details] [review] try to repair a PDF if a referenced object is missing Since the solution for bug 92508 does no more solve this issue, I go back to the former solution but with the knowledge I have now J So this patch now works for the PDF of comment 22.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/603.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.