Bug 30692

Summary: [PATCH] pdftops does not crop
Product: poppler Reporter: William Bader <williambader>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: williambader
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: proposed patch
pdf file with a block of text outside the crop box
alternate patch
patch against the 19 Nov 12 git source
Alternate patch against the 1 Dec 12 git source that passes a crop flag
screen capture to show the problem
Reduced version of your patch

Description William Bader 2010-10-07 15:30:47 UTC
Created attachment 39271 [details] [review]
proposed patch

pdftops sets the bounding box but does not crop anything outside the bounding box.  I have a pdf with a block of descriptive text outside the crop box, and the text gets into the eps (i.e. is not clipped out), whether or not I specify -nocrop.  With pdftops from xpdf-3.02pl4, the text gets in only if I specify -nocrop.  The pdf has a crop box that is smaller than the media box, and the text is in the space above the crop box but inside the media box.

It looks like Page::makeBox() sets *crop to gFALSE if useMediaBox is false.
pdftops calls doc->displayPages() with noCrop for the useMediaBox (because cropping should use the crop box, so noCrop should use the media box).
Since noCrop defaults to gFalse (which means that pdftops crops by default), useMediaBox is false, and makeBox() sets *crop = gFalse.

xpdf-3.02 has the same assignment of *crop = gFalse in Page.cc, but it stores the crop flag in GlobalParams (poppler does not), and it has a separate pageCrop flag (poppler reuses noCrop for pageCrop).

--- poppler-30sep10/poppler/Page.cc-    2010-09-30 22:48:46.463386325 +0200
+++ poppler-30sep10/poppler/Page.cc     2010-10-08 00:02:27.034792873 +0200
@@ -678,7 +678,7 @@
     *box = *mediaBox;
   } else {
     *box = *cropBox;
-    *crop = gFalse;
+    /* *crop = gFalse; */
   }
 }
Comment 1 William Bader 2010-10-07 15:40:23 UTC
Created attachment 39274 [details]
pdf file with a block of text outside the crop box

run
pdftops -eps pdfad.pdf
then view pdfad.eps with a larger bounding box, either with options in the ps viewer or by increasing the height in the bounding box comments.
The block of text above the image shows with poppler pdftops, but it should be cropped out by default, and it should show only if you run
pdftops -eps -nocrop pdfad.pdf
This causes problems in applications like lout http://savannah.nongnu.org/projects/lout that embed eps files without clipping them.
Comment 2 Albert Astals Cid 2010-10-12 15:20:42 UTC
I don't see how that patch can make any difference since Page::makeBox is used in PSOutputDev.cc:3006 and the crop variable there is not used anymore after the call to makeBox, so what is not changing it going to do?
Comment 3 William Bader 2010-10-13 10:49:12 UTC
The patch to Page::makeBox() works.  You can test it with the PDF attachment that I sent along with the patch.
PSOutputDev::checkPageSlice() is not the only place that calls Page::makeBox().
Page::createGfx() also calls Page::makeBox() and then calls new Gfx() with "crop ? cropBox : (PDFRectangle *)NULL".
The alternate patch below to Page::createGfx() also works for that PDF.

--- poppler-30sep10-alt/poppler/Page.cc-        2010-09-30 22:48:46.463386325 +0200
+++ poppler-30sep10-alt/poppler/Page.cc 2010-10-13 19:29:46.109048269 +0200
@@ -427,8 +427,9 @@
     rotate += 360;
   }

+  GBool tempCrop = crop;
   makeBox(hDPI, vDPI, rotate, useMediaBox, out->upsideDown(),
-         sliceX, sliceY, sliceW, sliceH, &box, &crop);
+         sliceX, sliceY, sliceW, sliceH, &box, &tempCrop);
   cropBox = getCropBox();
   mediaBox = getMediaBox();
Comment 4 William Bader 2010-10-13 10:49:56 UTC
Created attachment 39416 [details] [review]
alternate patch
Comment 5 Albert Astals Cid 2010-10-13 11:06:41 UTC
I know there are more places.

The code in Page::createGfx pases crop by pointer on purpose so it can get upadted, your code simply breaks that, it might gix some issue with printing to PS but i'm more than certain that it will break splash/cairo backends.

The code is logically correct, if Page::makeBox goes by the last if, you don't need to crop since you are already using the cropbox as box for your rendering.
Comment 6 William Bader 2010-10-13 11:26:30 UTC
Yes, I realize that the patch might break something.  I was just experimenting to track down the location of the problem.

>if Page::makeBox goes by the last if, you don't need to crop since you are already using the cropbox as box for your rendering.

By inspection of the generated EPS, it is more complicated.  pdftops always writes the correct BoundingBox comment, but it writes the postscript code to create a clipping region only if "crop" is true.

If you view the EPS and the viewer clips the display window to the BoundingBox, everything is OK, but if you print the EPS, the postscript clipping code is necessary or else things might show that should have been clipped out.

This issue is a regression from the behavior of pdftops in xpdf.

William
Comment 7 Albert Astals Cid 2010-10-13 11:55:28 UTC
Downloaded xpdf-3.02 and the ps it generates crashes gs. Looked at the code and didn't see any noticeable difference (yes they use two flags and a they store one of them in globalparams, nothing that should make the ps writing code different from what i can see)
Comment 8 William Bader 2010-10-13 14:04:07 UTC
Thanks for testing with xpdf.  I'm not sure why you had problems, because it works for me.
I am using xpdf-3.02pl4, which I think is the latest release.  I think that earlier versions work the same as this one.
I am using gs9.00.  I am running Fedora 13, and the distributed gs8.71 can also view the eps.
I can send the eps files or gv screen captures or other information if you want.

Here is what happens if I run normally and with -nocrop and then compare the results.  "re" makes a rectangular path from four numbers.  "/W" clips and then starts a new path.

$ /usr/local/bin/pdftops-3.02pl4 -v
pdftops version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
$ /usr/local/bin/pdftops-3.02pl4 -eps 20185846-cropbug.pdf crop.eps
$ /usr/local/bin/pdftops-3.02pl4 -eps -nocrop 20185846-cropbug.pdf nocrop.eps
$ ls -l crop.eps nocrop.eps
-rw-rw-rw- 1 william william 1254692 Oct 13 22:40 crop.eps
-rw-rw-rw- 1 william william 1254653 Oct 13 22:40 nocrop.eps
$ diff -U 10 crop.eps nocrop.eps
--- crop.eps    2010-10-13 22:40:14.022258605 +0200
+++ nocrop.eps  2010-10-13 22:40:18.782008510 +0200
@@ -412,22 +412,20 @@
 0 J
 10 M
 1 w
 /DeviceGray {} cs
 [0] sc
 /DeviceGray {} CS
 [0] SC
 false op
 false OP
 {} settransfer
-7.8891 16.1876 415.7905 1009.8124 re
-W
 q
 q
 1 i
 9.072 17.001 413.856 1008 re
 W
 /DeviceCMYK {} cs
 [0 0 0 0.2] sc
 false OP
 false op
 {} settransfer
$

If you have gs9.00, try the patch

--- gs9.00/base/gxccman.c-      2010-08-10 12:20:19.000000000 -0400
+++ gs9.00/base/gxccman.c       2010-09-28 15:50:14.053145974 -0400
@@ -609,7 +609,7 @@
        gs_make_mem_mono_device(pdev, pdev->memory, target);
        rc_decrement_only(target, "gx_alloc_char_bits"); /* can't go to 0 */
         /* Decrement the ICC profile also.  Same device is getting reinitialized */
-        rc_decrement(target->device_icc_profile,"gx_alloc_char_bits(icc profile)");
+        if (target != NULL) rc_decrement(target->device_icc_profile,"gx_alloc_char_bits(icc profile)");
        pdev->rc = rc;
        pdev->retained = retained;
        pdev->width = iwidth;

If you have gs8.71, try the patch

--- gs8.71/lib/pdf2dsc.ps-      2008-02-25 06:48:45.000000000 +0100
+++ gs8.71/lib/pdf2dsc.ps       2010-02-19 22:14:55.000000000 +0100
@@ -116,7 +116,7 @@
    DSCfile PDFname write==only
    ( \(r\) file { DELAYSAFER { .setsafe } if } stopped pop\n) puts
    ( pdfopen begin\n) puts
-   ( copy_trailer_attrs\n) puts
+   ( process_trailer_attrs\n) puts
    (%%EndSetup\n) puts

    /.hasPageLabels false def % see "Page Labels" in the PDF Reference

If you have gs8.70, try the patch

--- gs8.70/Resource/Init/pdf_draw.ps-   2009-07-08 20:28:02.000000000 -0400
+++ gs8.70/Resource/Init/pdf_draw.ps    2009-08-17 17:42:54.000000000 -0400
@@ -262,7 +262,7 @@
 } bdef

 /resolvehalftone {     % <dict> resolvehalftone <halftone>
-  dup /HalftoneType get
+  dup /HalftoneType oget
   dup //htrdict exch .knownget {
     exch pop exec
   } {

If gs crashes with a floating point exception, try the patch below (I can give you versions for other versions of gs).

--- gs9.00/base/gsmisc.c-       2010-05-06 12:04:27.000000000 -0400
+++ gs9.00/base/gsmisc.c        2010-09-28 14:17:42.677102939 -0400
@@ -883,17 +883,25 @@
  */
 static double const_90_degrees = 90.;

+/* const_90_degrees no longer works for GCC 4.1.2.
+ * As old fortran programmers know, to avoid rounding problems,
+ * real numbers must be compared as "fabs(x-y) < eps"
+ * and not for equality like "floor(x) == y".
+ */
+
+#define        is_integer(x)   (fabs(floor((x) + 0.5) - (x)) <= 1.0e-15)
+
 double
 gs_sin_degrees(double ang)
 {
     double quot = ang / const_90_degrees;

-    if (floor(quot) == quot) {
+    if (is_integer(quot)) {
         /*
          * We need 4.0, rather than 4, here because of non-ANSI compilers.
          * The & 3 is because quot might be negative.
          */
-        return isincos[(int)fmod(quot, 4.0) & 3];
+       return isincos[(int)fmod(floor(quot+0.5), 4.0) & 3];
     }
     return sin(ang * (M_PI / 180));
 }
@@ -903,9 +911,9 @@
 {
     double quot = ang / const_90_degrees;

-    if (floor(quot) == quot) {
+    if (is_integer(quot)) {
         /* See above re the following line. */
-        return isincos[((int)fmod(quot, 4.0) & 3) + 1];
+       return isincos[((int)fmod(floor(quot+0.5), 4.0) & 3) + 1];
     }
     return cos(ang * (M_PI / 180));
 }
@@ -915,9 +923,9 @@
 {
     double quot = ang / const_90_degrees;

-    if (floor(quot) == quot) {
+    if (is_integer(quot)) {
         /* See above re the following line. */
-        int quads = (int)fmod(quot, 4.0) & 3;
+       int quads = (int)fmod(floor(quot+0.5), 4.0) & 3;

         psincos->sin = isincos[quads];
         psincos->cos = isincos[quads + 1];
Comment 9 dror_lev 2012-10-07 11:46:25 UTC
The last post is from a couple of years ago, but the problem seems to persist, at least in Libre-Office (https://bugs.freedesktop.org/show_bug.cgi?id=40163).

Some recent "playing around" and experiments for workarounds can be seen in this LibreOfficeForum thread: http://libreofficeforum.org/node/4183

Is there any work on solving this issue, or is planned for an incoming version?

Thanks,
dror
Comment 10 William Bader 2012-10-07 15:38:30 UTC
When I build poppler pdftops for my own use, I continue to apply the patch that I suggested in comment 3.  Can you check if that fixes the problem in LibreOffice?
Comment 11 Albert Astals Cid 2012-11-19 21:25:34 UTC
Is this patch still needed? Can you push a rebased (i guess this one no longer applies) version?
Comment 12 William Bader 2012-11-19 23:46:46 UTC
Created attachment 70289 [details] [review]
patch against the 19 Nov 12 git source

I made a version of the patch against 19 Nov 12 source.  The patch still works for me with the PDF that I attached previously.
createGfx() calls makeBox() and then Gfx().
makeBox() sometimes sets crop to FALSE, and then createGfx() does not pass a crop box to Gfx().
The patch uses the original value of crop in the call to Gfx().
Some files, like the one that I attached, draw outside the crop box, so even if the PDF sets a crop box of a given size, you still need to generate the postscript commands to crop to that size.
William
Comment 13 Albert Astals Cid 2012-12-01 14:23:16 UTC
(In reply to comment #12)
> Created attachment 70289 [details] [review] [review]
> patch against the 19 Nov 12 git source
> 
> I made a version of the patch against 19 Nov 12 source.  The patch still
> works for me with the PDF that I attached previously.
> createGfx() calls makeBox() and then Gfx().
> makeBox() sometimes sets crop to FALSE, and then createGfx() does not pass a
> crop box to Gfx().
> The patch uses the original value of crop in the call to Gfx().
> Some files, like the one that I attached, draw outside the crop box, so even
> if the PDF sets a crop box of a given size, you still need to generate the
> postscript commands to crop to that size.

Right, this still does not make sense, you are creating a tempvalue so that the returned value for makeBox is not used, if that is what you really want, why just not remove that parameter?

On the other hand if this is about the PS conversion i do think that it is the PSOutputDev that should be modified and for example cropping itself on the box passed in the state of startPage ?
Comment 14 William Bader 2012-12-01 17:57:44 UTC
>if that is what you really want, why just not remove that parameter?

There are several calls to makeBox().  One of the calls to makeBox() in PSOutputDev::checkPageSlice() uses the crop parameter.

>i do think that it is the PSOutputDev that should be modified

I think that Page::createGfx() is the right place because it calls makeBox() (which returns both the box and the crop flag) and it makes the crop decision when it sets arguments to new Gfx().  The further that you separate the place that has the information from the place that makes the decision, the messier it gets.

Wouldn't doing it anywhere else require splitting Page::createGfx() into two parts or making more internals of Page or Gfx public?

William
Comment 15 Albert Astals Cid 2012-12-01 18:21:38 UTC
(In reply to comment #14)
> >if that is what you really want, why just not remove that parameter?
> 
> There are several calls to makeBox().  One of the calls to makeBox() in
> PSOutputDev::checkPageSlice() uses the crop parameter.

Ah, right, didn't see them

> 
> >i do think that it is the PSOutputDev that should be modified
> 
> I think that Page::createGfx() is the right place because it calls makeBox()
> (which returns both the box and the crop flag) and it makes the crop
> decision when it sets arguments to new Gfx().  The further that you separate
> the place that has the information from the place that makes the decision,
> the messier it gets.
> 
> Wouldn't doing it anywhere else require splitting Page::createGfx() into two
> parts or making more internals of Page or Gfx public?

Why? The issue as I see it is that you give the PSOutputDev a box, and it allows commands outside that box to be rendered, that is wrong and shall be fixed at the PSOutputDev level, not somewhere else.
Comment 16 William Bader 2012-12-01 19:04:00 UTC
>The issue as I see it is that you give the PSOutputDev a box, and it allows commands outside that box to be rendered, that is wrong and shall be fixed at the PSOutputDev level, not somewhere else.

With the patch, pdftops crops by generating a single crop box in postscript at the beginning of the page.  I think that xpdf (which works) does that also.  Checking the crop box inside poppler before rendering each object could be expensive, and it could be tricky with text that was part inside and part outside.

In theory, a pdf should not attempt to draw much outside the crop box other than a small amount of descriptive information (as in the example pdf that I attached) or registration marks for printing.  The example pdf has an image of an ad that was intended to be placed on a newspaper page.  The crop box marks the part of the image that should appear on the page.  The objects outside the crop box are for proofing the image as a stand-alone item.

I have the same problem as you that I have very little time.  If you can tell me exactly where poppler should make the crop test, I can try it.  The patch has been working for me for years, and one of the libreoffice people said that it worked for them also, and it is only a very small change.  Isn't a bigger change more likely risk side-effects that could take a while to discover?
Comment 17 Albert Astals Cid 2012-12-01 19:14:04 UTC
(In reply to comment #16)
> >The issue as I see it is that you give the PSOutputDev a box, and it allows commands outside that box to be rendered, that is wrong and shall be fixed at the PSOutputDev level, not somewhere else.
> 
> With the patch, pdftops crops by generating a single crop box in postscript
> at the beginning of the page.  I think that xpdf (which works) does that
> also.  Checking the crop box inside poppler before rendering each object
> could be expensive, and it could be tricky with text that was part inside
> and part outside.

I never suggested checking the for each object. I suggested that PSOutputDev::startPage generates the cropbox with the given box in the state

> 
> In theory, a pdf should not attempt to draw much outside the crop box other
> than a small amount of descriptive information (as in the example pdf that I
> attached) or registration marks for printing.  The example pdf has an image
> of an ad that was intended to be placed on a newspaper page.  The crop box
> marks the part of the image that should appear on the page.  The objects
> outside the crop box are for proofing the image as a stand-alone item.

Yes I know what a crop box is for.

> I have the same problem as you that I have very little time.  If you can
> tell me exactly where poppler should make the crop test, I can try it.  The
> patch has been working for me for years, and one of the libreoffice people
> said that it worked for them also, and it is only a very small change. 
> Isn't a bigger change more likely risk side-effects that could take a while
> to discover?

Because to me the patch looks like a hack, yes, hacks work, but they are not proper fixes.
Comment 18 William Bader 2012-12-01 20:48:38 UTC
>I suggested that PSOutputDev::startPage generates the cropbox with the given box in the state

I think that with the patch, the postscript clip is generated by calls in Gfx::Gfx() when Page::createGfx() calls new Gfx().

I agree that my patch is hacky.  So I understand, you want to keep Gfx::Gfx() and have PSOutputDev::startPage() generate the clip line?

Can startPage() use page->getCropBox() ?  That would save passing it.

Gfx::Gfx() calls 
  out->startPage(pageNum, state);
  out->setDefaultCTM(state->getCTM());
before calling
  out->clip(state);
so that the CTM is initialized before generating the clip.

If PSOutputDev::startPage() sets up the clip, it would be generated before the CTM is initialized. Is that a problem?

Since Page::createGfx() is generic to all output, while PSOutputDev::startPage() is specific to postscript, would moving the patch from createGfx() to startPage() mean other devices that wanted to crop to the cropbox would need to implement it in their own startPage()>

William
Comment 19 Albert Astals Cid 2012-12-01 23:05:05 UTC
(In reply to comment #18)
> >I suggested that PSOutputDev::startPage generates the cropbox with the given box in the state
> 
> I think that with the patch, the postscript clip is generated by calls in
> Gfx::Gfx() when Page::createGfx() calls new Gfx().

Yes

> I agree that my patch is hacky.  So I understand, you want to keep
> Gfx::Gfx() and have PSOutputDev::startPage() generate the clip line?

Yes, well not startPage specifically but something in PSOutputDev and startPage seemed like a good place with a quick look

> Can startPage() use page->getCropBox() ?  That would save passing it.

No, but you can use the pageBox of the state (ok, now i see that the GfxState takes the pageBox but doesn't really store it, but i wouldn't mind it getting stored if we need to)

> 
> Gfx::Gfx() calls 
>   out->startPage(pageNum, state);
>   out->setDefaultCTM(state->getCTM());
> before calling
>   out->clip(state);
> so that the CTM is initialized before generating the clip.
> 
> If PSOutputDev::startPage() sets up the clip, it would be generated before
> the CTM is initialized. Is that a problem?

Hmmm, that might be a problem yes, it is weird that this is done in two steps :-/

> Since Page::createGfx() is generic to all output, while
> PSOutputDev::startPage() is specific to postscript, would moving the patch
> from createGfx() to startPage() mean other devices that wanted to crop to
> the cropbox would need to implement it in their own startPage()>

No, other devices that crop to the cropbox already work, note that the situation you are mentioning is not cropping to the cropbox but using the cropbox as box for your rendering (the only way for Page::makeBox to set crop to false is when you say you don't want to use the mediabox) so effectively you don't need to crop since your box is already the cropbox, i understand you say that drawing outside the box when using the cropbox as box creates wrong renderings, and i think this is a bug in the PSOutputDev and that's where I'd prefer the fix to be located there.

This is the situation we want to fix, right?

Now another solution would be adding a virtual doINeedAClipAlways to OutputDev and make PSOutputDev say true and hook this up in the clip calls of Gfx::Gfx
Comment 20 William Bader 2012-12-02 02:00:44 UTC
Created attachment 70901 [details] [review]
Alternate patch against the 1 Dec 12 git source that passes a crop flag

This patch adds virtual GBool needClipToCropBox() to OutputDev and PSOutputDev.
It returns gFalse in OutputDev and the value of a class variable in PSOutputDev.
pdftops sets the flag when it creates a new PSOutputDev.
Page::createGfx() checks out->needClipToCropBox().
This changes the signature of PSOutputDev().  If that is bad, I can make a function to set the flag.
William
Comment 21 Albert Astals Cid 2012-12-08 17:28:04 UTC
This is not really what i had in mind, but before i propose what i had in mind, can you explain why the "doEPS && !noCrop"? I thought we needed this all the time (i.e the initial patch doesn't seem to do anything with doEPS)
Comment 22 William Bader 2012-12-09 07:01:35 UTC
When pdftops makes a PS instead of an EPS, it already crops.  The problem with not cropping happens only when it is making an EPS, so I set the force crop flag with doEPS && !noCrop.  Page::makeBox() sets *crop = gFalse (which is the core of the problem) only when using the cropbox instead of the mediabox.
William
Comment 23 Albert Astals Cid 2012-12-11 18:28:44 UTC
Interesting, so this is only about EPS?

Because i see in PSOutputDev::writeHeader case psModeEPS: the cropbox is being used to set the BoundingBox already

Also i've run 
pdftops -eps file_in_this_bug.pdf output.eps
and the result looks fine (to me) when opening such eps in gimp for example

Could you attach an image of gimp opening the output of the and pointing to where the problem lies?
Comment 24 William Bader 2012-12-11 22:34:14 UTC
Created attachment 71362 [details]
screen capture to show the problem

The initial PDF that I attached to this bug is sufficient to reproduce the problem.  After running pdftops, you need to view the EPS with a program that is capable of showing all of the image and not only what is inside the area specified in the %%BoundingBox comment.

In the screen capture, I used the unpatched pdftops 0.18.4 distributed with Fedora 17.  I made both a PS and an EPS and then viewed them with "gv".

gimp might not be a good test, because as far as I know, it does not natively support PDF, PS or EPS, and so it must rasterize them, and it probably rasterizes them to the size in the %%BoundingBox comment, which means that you will never see anything outside the crop box.

gv allows you to override the %%BoundingBox comment.  I selected a large size that I used for previewing broadsheet newspaper pages.
pdftops crops PS.  If you try this with my test PDF and then compare the PS and EPS files, the PS has the additional code
0.5 55 translate
7 16 417 1010 re W
immediately before the %%EndPageSetup comment.
My guess is that someone thought that it was not necessary to crop the EPS because application embedding the EPS on a page would always crop it to the size in the %%BoundingBox.

William
Comment 25 Albert Astals Cid 2012-12-13 17:11:35 UTC
Created attachment 71452 [details] [review]
Reduced version of your patch

Here's a reduced version of your patch, what do you think?
Comment 26 William Bader 2012-12-13 18:10:25 UTC
Thanks, I tested your reduced patch, and it works fine for me.
Comment 27 Albert Astals Cid 2012-12-13 18:26:25 UTC
Nice, commited!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.