Bug 77549 - pdfseparate takes a long time and does not reduce file size
Summary: pdfseparate takes a long time and does not reduce file size
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Mac OS X (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-17 00:16 UTC by Cory
Modified: 2014-07-25 22:13 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
remove annotations from separated pages (803 bytes, patch)
2014-04-24 11:54 UTC, Thomas Freitag
Details | Splinter Review
just remove page annotations which reference a page not separated (9.09 KB, patch)
2014-04-25 12:44 UTC, Thomas Freitag
Details | Splinter Review
additonal handling for annotationas (1.84 KB, patch)
2014-06-20 13:11 UTC, Thomas Freitag
Details | Splinter Review
additonal handling for annotationas (8.37 KB, patch)
2014-07-07 11:42 UTC, Thomas Freitag
Details | Splinter Review

Description Cory 2014-04-17 00:16:16 UTC
Hello, 

I am working with Poppler version 0.24.5 installed via Homebrew on Mavericks.

I have a PDF that is 203 MB, and each page separated (there are hundreds) via pdfseparate comes out at 197MB. 

The data in the PDF is sensitive, so if someone could e-mail me at cory at plangrid dot com, I could send them a Dropbox link. 

Thanks!
Cory Lorenz
Comment 1 Thomas Freitag 2014-04-24 11:54:08 UTC
Created attachment 97897 [details] [review]
remove annotations from separated pages

The problem is the usage of page annotations: Because page annotations have references to the page which have a reference to the parent and so on and so on all objects become marked and are written to the output file, and it takes a lot of time because we get endless loops and it takes a while after the endless loop detection fires.
Pdfseparate is just a small tool to separate single pages, and I think we can live with the fact that annoations are removed. Since acroforms can reference in their fields also annotations, I removed the acroform entry, too.

@Albert: if You want to test the patch yourself I think I can forward the mail from Cory.
Comment 2 Albert Astals Cid 2014-04-24 21:14:05 UTC
I'm not so sure it is a good idea dropping annotations tbh, i mean annotations are part of the page, as a user i'd be surprised if i asked the tool to extract page 3 and then the annotations weren't there.
Comment 3 Cory 2014-04-24 21:30:46 UTC
Hi Thomas, are you talking about pdfmarks style annotations (e.g. lines, curves, etc)? In that case, we would need to keep them.
Comment 4 Thomas Freitag 2014-04-25 12:44:11 UTC
Created attachment 97953 [details] [review]
just remove page annotations which reference a page not separated

I'd already feared that You give such comments. But I couldn't resist the temptation to make it easy for me :-)
So here a solution which removes just those page annotations which reference not separated pages. It was much harder to implement, and even not easy to test it. But I can open the separated PDF with acrobat reader as well as with okular.

But I can't see any differences when open them to those I created with my former patch, beside that acrobat reader gives me the form message if I opened the first separated page (but I can't see a form, but that's the same as if I open the original page).

@ALbert: Can You also have a look at the results? I'll forward Cory's mail.
Comment 5 Albert Astals Cid 2014-05-06 23:47:27 UTC
Looks good, and there's interesting size changes.

Well done as always :)

Now the interesting question, do we want this is stable as a bugfix or is it a feature and should land to master?

I can see it both ways and i'm undecided.
Comment 6 Thomas Freitag 2014-05-07 09:14:04 UTC
(In reply to comment #5)
> Looks good, and there's interesting size changes.
> 
> Well done as always :)
> 
> Now the interesting question, do we want this is stable as a bugfix or is it
> a feature and should land to master?
> 
> I can see it both ways and i'm undecided.

How would You classify a solution for "100% CPU usage"? I think, this is comparable.

But I would appreciate if You classify it as a bugfix. Since the changes only affects pdfseparate/pdfunite the risk is negligible and we get a earlier feedback if we need additional adjustments.
Comment 7 Albert Astals Cid 2014-05-07 19:55:52 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Looks good, and there's interesting size changes.
> > 
> > Well done as always :)
> > 
> > Now the interesting question, do we want this is stable as a bugfix or is it
> > a feature and should land to master?
> > 
> > I can see it both ways and i'm undecided.
> 
> How would You classify a solution for "100% CPU usage"? I think, this is
> comparable.
> 
> But I would appreciate if You classify it as a bugfix. Since the changes
> only affects pdfseparate/pdfunite the risk is negligible and we get a
> earlier feedback if we need additional adjustments.

I honestly think it's not exactly the same, and even for a 100% cpu bug it may be more of a feature than a bugfix, but let's commit it to stable anyway :)
Comment 8 Cory 2014-05-07 21:46:29 UTC
Thanks guys, that patch looks like it does the trick!
Comment 9 Cory 2014-06-12 19:26:01 UTC
Hi, 

Sorry to reopen the issue, but a bunch of files came up displaying similar behavior. I e-mailed them to Thomas, and can send them to anyone else who is interested. Thanks!
Comment 10 Thomas Freitag 2014-06-20 13:11:42 UTC
Created attachment 101431 [details] [review]
additonal handling for annotationas

The problem are still page annotations, but in the new PDF these annotaions were not referenced in the page dictionnary itself but in one of the stream dictionnaries of content, so I now changed the markDictionnary to handle them everywhere correctly.
Comment 11 Thomas Freitag 2014-07-07 11:42:23 UTC
Created attachment 102369 [details] [review]
additonal handling for annotationas

Reviewed this patch once again and didn't like the usage of public class members anymore, Use them as parameter of different methods, even if this is some more work.
Comment 12 Albert Astals Cid 2014-07-24 22:28:34 UTC
Thomas your patch only applies to master and not to poppler-0.26. Do you want me to apply it only to master or can you provide also a patch for poppler-0.26 ?
Comment 13 Thomas Freitag 2014-07-25 07:22:38 UTC
(In reply to comment #12)
> Thomas your patch only applies to master and not to poppler-0.26. Do you
> want me to apply it only to master or can you provide also a patch for
> poppler-0.26 ?

I don't know when I get time to have a look on this again, so for me it would be sufficient if You just commit it to master.
Comment 14 Albert Astals Cid 2014-07-25 22:13:45 UTC
master it is then :) Pushed!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.