87637 – pdfseparate+pdftoppm renders different than original file

Bug 87637 - pdfseparate+pdftoppm renders different than original file

Summary: pdfseparate+pdftoppm renders different than original file

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-23 15:01 UTC by Albert Astals Cid
Modified:	2018-08-21 11:17 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
The said file (134.31 KB, text/plain) 2014-12-23 15:01 UTC, Albert Astals Cid	Details
use copy of xref in savePageAs so that it can be safely modified (725 bytes, patch) 2015-01-06 14:11 UTC, Thomas Freitag	Details \| Splinter Review
File that gets broken with proposed patch (40.60 KB, text/plain) 2015-02-04 16:12 UTC, Albert Astals Cid	Details
use always an unique instance for PDFDoc for savePageAs (812 bytes, patch) 2015-02-16 13:46 UTC, Thomas Freitag	Details \| Splinter Review
File that renders different separated than in original (6.05 KB, application/pdf) 2015-02-24 22:38 UTC, Albert Astals Cid	Details
Refinement of the /P annotation test (1.23 KB, patch) 2015-02-26 08:37 UTC, Thomas Freitag	Details \| Splinter Review
refine resource detection (787 bytes, patch) 2015-09-06 10:00 UTC, Thomas Freitag	Details \| Splinter Review
File mentioned on comment #15 (1.72 MB, application/force-download) 2016-02-04 00:14 UTC, Albert Astals Cid	Details
round mediabox at tenth decimal place (825 bytes, patch) 2016-02-05 13:04 UTC, Thomas Freitag	Details \| Splinter Review
reconstruct xref table if xrefs are missing but needed (2.76 KB, patch) 2016-02-08 13:47 UTC, Thomas Freitag	Details \| Splinter Review
try to repair a PDF if a referenced object is missing (2.77 KB, patch) 2016-03-15 14:45 UTC, Thomas Freitag	Details \| Splinter Review
Show Obsolete (4) View All

Description Albert Astals Cid 2014-12-23 15:01:10 UTC

Created attachment 111220 [details]
The said file

With the attached file, if you split it with pdfseparate, the file that represents the third page will have an extra square on the top right part of the page when rendered with pdftoppm that is not there if you render the third page of the original file.

Comment 1 Albert Astals Cid 2014-12-23 15:02:07 UTC

Thomas is this something you can have a look?

Comment 2 Thomas Freitag 2014-12-28 11:38:58 UTC

(In reply to Albert Astals Cid from comment #1)
> Thomas is this something you can have a look?

I will have a look at it

Comment 3 Thomas Freitag 2015-01-06 14:11:54 UTC

Created attachment 111850 [details] [review]
use copy of xref in savePageAs so that it can be safely modified

The problem was the removing of page annotations which were not referenced by the separated page. This modified the xref of the document. So during the separation of page 1 the page annotations of page 3 were removed and were therefore no more available when page 3 was separated. You can see that when You use i.e. pdfseparate -f 3 -l 3 with the said PDF. In this case the separated PDF of page 3 is okay.

To solve this problem the xref of the document in this patch first will copied and so now can be safely modified, and after the work is done the xref will be restored.

Comment 4 Albert Astals Cid 2015-02-04 16:12:06 UTC

This seems to break with the attached file

Comment 5 Albert Astals Cid 2015-02-04 16:12:44 UTC

Created attachment 113165 [details]
File that gets broken with proposed patch

Comment 6 Thomas Freitag 2015-02-16 13:46:14 UTC

Created attachment 113535 [details] [review]
use always an unique instance for PDFDoc for savePageAs

To solve this problem now a unique instance of PDFDoc is used so that it can be safely modified!

Comment 7 Albert Astals Cid 2015-02-24 22:35:51 UTC

Yep, that one fixes this one and has no regression i can find. There's still some files that generate different content on pdfseparated files than orignal, i'll attach the next one in a sec.

Comment 8 Albert Astals Cid 2015-02-24 22:38:52 UTC

Created attachment 113803 [details]
File that renders different separated than in original

Comment 9 Thomas Freitag 2015-02-26 08:35:52 UTC

(In reply to Albert Astals Cid from comment #8)
> Created attachment 113803 [details]
> File that renders different separated than in original

Funny use of a /P entry in an annotation dictionary. The spec says:

(Optional except as noted below; PDF 1.3; not used in FDF files) An indirect reference to the page object with which this annotation is associated.

But here the /P doesn't point to a page object (or at least not to a dictionary of type page), it points to a dictionary without any type and just some information which normally are use in a page dictionary.

I used the /P entry to decide if I can remove that annotation because it references to a page which will not be separated. Seems as if I need to refine that test a little bit :-)

Comment 10 Thomas Freitag 2015-02-26 08:37:59 UTC

Created attachment 113837 [details] [review]
Refinement of the /P annotation test

This patch solves the issue of comment 8

Comment 11 Albert Astals Cid 2015-08-27 20:17:53 UTC

Patch from comment 10 commited!

Comment 12 Albert Astals Cid 2015-08-27 20:26:37 UTC

Next file that breaks, first page of https://bugs.kde.org/attachment.cgi?id=78512 renders white after pdfseparate

Comment 13 Thomas Freitag 2015-09-06 10:00:01 UTC

Created attachment 118096 [details] [review]
refine resource detection

In case of the PDF from comment 11 the resource dict used by the first page is neither refered by the catalog dict not by the page dict itself. Therefore we need to refine the search for the resource dict used by the page. This is done by this new patch.

Comment 14 Albert Astals Cid 2016-02-04 00:13:08 UTC

Commited patch from comment #13

Comment 15 Albert Astals Cid 2016-02-04 00:14:08 UTC

Unfortunately there's some more files that render slightly different, the next one i'm attaching i can't see difference visually but a pixel per pixel comparison does yield some differences.

Comment 16 Albert Astals Cid 2016-02-04 00:14:59 UTC

Created attachment 121505 [details]
File mentioned on comment #15

Comment 17 Thomas Freitag 2016-02-04 10:55:13 UTC

Hmmh, Albert, I'm not able to see any differences between the png's created with following commands, even not by a pixel by pixel comparison:

./utils/pdfseparate 87637.open/bug306008.pdf output/bug306008.pdf
./utils/pdftoppm -png -cropbox 87637.open/bug306008.pdf output/bug306008-orig
./utils/pdftoppm -png -cropbox output/bug306008.pdf output/bug306008-sep

But there is a difference between the PDF separated and the original:

MediaBox[0 0 455.3858267716536 595.1692913385828]
MediaBox[0 0 455.3858267717 595.1692913386]

The reason for it is in 

void PDFDoc::writeObject (Object* obj, OutStream* outStr, XRef *xRef, Guint numOffset, Guchar *fileKey,
                          CryptAlgorithm encAlgorithm, int keyLength, int objNum, int objGen)

: : :

s.appendf("{0:.10g}", obj->getReal());

Can You change that in Your test to

s.appendf("{0:.15g}", obj->getReal());

and look, if there are still pixel changes?

Comment 18 Albert Astals Cid 2016-02-04 23:51:19 UTC

No that doesn't help, somehow it ends up breaking the file output.

I realized it goes away if i don't use -r 72 on the pdftoppm call, any idea why that may be?

Comment 19 Thomas Freitag 2016-02-05 13:01:35 UTC

(In reply to Albert Astals Cid from comment #18)
> No that doesn't help, somehow it ends up breaking the file output.
> 

Oh, there is a bug in the goo formatter which produces illegal floating point strings in this case. 

> I realized it goes away if i don't use -r 72 on the pdftoppm call, any idea
> why that may be?

Yes, it's definetely the difference in the mediabox. If You have an exact look at the pixel which changed You will encounter that they are always at the beginning or end of a line which is drawn by the stroke command. The comes from the different mediabox and anti aliasing effects computed for the start or the end of a stroke path.

With 144 dpi the difference is small enough so that the output is the same.

With pdfcairo there are no differences in the output.

Also if You compile poppler with 

#define USE_FIXEDPOINT 1

You will see no difference. But to prove it beyond doubt I will upload a small patch just for testing purposes with rounds the mediabox at the tenth decimal place :-)

Comment 20 Thomas Freitag 2016-02-05 13:04:12 UTC

Created attachment 121536 [details] [review]
round mediabox at tenth decimal place

If You apply this test patch, create the png's once again You will see no difference anymore

Comment 21 Albert Astals Cid 2016-02-07 12:53:45 UTC

ok, it's mediabox, i think i have other files that still fail, will try to find one later.

Comment 22 Albert Astals Cid 2016-02-07 22:43:25 UTC

Next one, the text on https://bugs.freedesktop.org/attachment.cgi?id=50980 after running through pdfseparate is all broken.

Comment 23 Thomas Freitag 2016-02-08 09:57:59 UTC

I'm not sure, if I can solve this without breaking the fact, that pdfseparate just looks at the PDF structure:
The PDF of comment 22 is broken, pdftoppm says "Syntax Error: Couldn't find trailer dictionary" and if You have a look into it, the xref section at the end just contains 12 entries, where the page needs at least 40 to render it correctly.

But I will look if I can detect defect PDFs early enough and call the repair algorithm BEFORE the separate algorithm starts.

Comment 24 Thomas Freitag 2016-02-08 13:45:36 UTC

I found now the reason:

In case of the PDF of comment 22 the xref table will never be reconstructed by pdfseparate, because the trailer, the startxref as well as the xref section are valid, just that the xref section doesn't contain all objects needed to render the PDF.
And since pdfseparate first calls xref->scanSpecialFlags() which calls readXRefUntil(-1), the constructXRef branch in readXRefUntil will never be reached anymore.

I'm not sure if this could not cause problems in other areas, too, but I finally decided to write a new method getEntryAndReconstruct() which now is called only by PDFDoc::markObject instead of simple getEntry() and which tries to reconstruct the xref table if a num should be fetched which doesn't exist in the xref table.

Comment 25 Thomas Freitag 2016-02-08 13:47:56 UTC

Created attachment 121586 [details] [review]
reconstruct xref table if xrefs are missing but needed

This patch also has the advantage that it repairs the PDF of comment 22, so the resulting PDF is sanitized J

Comment 26 Thomas Freitag 2016-03-10 16:45:09 UTC

patch of comment 25 is obsolete with solution for bug 94260, so PDF of comment 22 will be handled correctly when bug 94260 is fixed!

Comment 27 Thomas Freitag 2016-03-14 13:59:11 UTC

Found a better solution attached to bug 92508

Comment 28 Thomas Freitag 2016-03-15 14:45:54 UTC

Created attachment 122326 [details] [review]
try to repair a PDF if a referenced object is missing

Since the solution for bug 92508 does no more solve this issue, I go back to the former solution but with the knowledge I have now J

So this patch now works for the PDF of comment 22.

Comment 29 GitLab Migration User 2018-08-21 11:17:36 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/603.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.