Bug 71213

Summary:	Add new PDFWriter class for writing PDF with printing options
Product:	poppler	Reporter:	Adrian Johnson <ajohnson>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED MOVED	QA Contact:
Severity:	normal
Priority:	medium	CC:	bugzilla, felix+freedesktop, mkasik, rishi.is, till.kamppeter
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Add PDFWriter pdftopdf test program Add DeflateStream glib API for PDFWriter encrypted document fix compress decrypted streams fix typo Use consistent line ending in PDFDoc Add PDFWriter glib API for PDFWriter PDFWriter changes to address review comments Font embedding Use of multiple PDFs by single PDFWriter 1 - Add PDFWriter 2 - pdftopdf test program 3 - Add DeflateStream 4 - glib API for PDFWriter 5 - encrypted document fix 6 - compress decrypted streams 7 - fix typo 8 - Use consistent line ending in PDFDoc 9 - gitignore 10 multiple documents

Description Adrian Johnson 2013-11-04 10:54:39 UTC

Created attachment 88591 [details]
Add PDFWriter

In order to get evince to print in PDF format without restricting the output to only what cairo supports we need a way to rewrite the PDF with the print settings from the print dialog applied. This will avoid stripping color management from the PDF, avoid cairo bugs, and result in much faster printing.

The first patch adds a new PDFWriter class for writing the PDF file with the selected printing options. These options include, page selection, number of copies, collation, number-up, scaling, shrink/fit page, paper size selection, orientation etc.

Comment 1 Adrian Johnson 2013-11-04 10:55:56 UTC

Created attachment 88592 [details] [review]
pdftopdf test program

This patch is a test program for the PDFWriter class.

Comment 2 Adrian Johnson 2013-11-04 10:57:21 UTC

Created attachment 88593 [details] [review]
Add DeflateStream

This patch adds a DeflateStream output stream filter for compressing the XObject streams in PDFWriter.

Comment 3 Adrian Johnson 2013-11-04 10:58:18 UTC

Created attachment 88594 [details] [review]
glib API for PDFWriter

glib api for writing PDFs with printing options

Comment 4 Adrian Johnson 2013-11-04 10:59:29 UTC

Created attachment 88595 [details] [review]
encrypted document fix

I'm not sure if this patch is correct. It seems to fix the problem with PDFWriter and encrypted documents.

Comment 5 Adrian Johnson 2013-11-04 11:00:58 UTC

Created attachment 88596 [details] [review]
compress decrypted streams

This uses DeflateStream to recompress streams that have been decrypted.

Comment 6 Adrian Johnson 2013-11-04 11:01:48 UTC

Created attachment 88597 [details] [review]
fix typo

Fix the incorrect spelling of 'Dictionary' in PDFDoc.

Comment 7 Adrian Johnson 2013-11-04 11:02:43 UTC

Created attachment 88598 [details] [review]
Use consistent line ending in PDFDoc

Use only '\n' in PDFDoc.

Comment 8 Adrian Johnson 2013-11-10 07:40:08 UTC

Created attachment 88957 [details] [review]
Add PDFWriter

Fix a bug.

Comment 9 Adrian Johnson 2013-11-10 07:41:50 UTC

Created attachment 88958 [details] [review]
glib API for PDFWriter

Fix some bugs in glib api. I've tested this api with evince. https://bugzilla.gnome.org/show_bug.cgi?id=711761 contains the patches.

Comment 10 Carlos Garcia Campos 2013-11-10 14:47:33 UTC

Comment on attachment 88957 [details] [review]
Add PDFWriter

Review of attachment 88957 [details] [review]:
-----------------------------------------------------------------

This looks great to me. There are several memory leaks and some other things that could be improved a bit.

::: poppler/PDFWriter.cc
@@ +62,5 @@
> +  pages.push_back(page);
> +}
> +
> +struct PageScale {
> +  int paper;

This can be unsigned

@@ +104,5 @@
> +    std::swap(width, height);
> +
> +  // find smallest paper size that will fit this page
> +  std::priority_queue<PageScale> queue;
> +  for (int i = 0; i < (int)paperSizes.size(); i++) {

Why not using unsigned for i instead of casting?

@@ +186,5 @@
> +      mediaSize->y2 = paperSizes[0].width;
> +      margins->x1 = paperSizes[0].bottom;
> +      margins->x2 = mediaSize->y2 - paperSizes[0].top;
> +      margins->y1 = paperSizes[0].left;
> +      margins->y2 = mediaSize->x2 - paperSizes[0].right;

This block is repeated a lot, I wonder if we could add a helper function, something like setSizeAndMargins(PDFRectangle *mediaSize, PDFRectangle *margins, double with, double height, double top, double bottom, double left, double right); I could be static inline

@@ +376,5 @@
> +void PDFWriter::writePageObject(int pageNum, int copy, PDFRectangle *mediaSize)
> +{
> +  PDFRectangle margins;
> +  Matrix ctm;
> +  Object ctmObj;

This looks unused.

@@ +432,5 @@
> +      obj2.free();
> +    }
> +  } else {
> +    writeObject(&contentsObj);
> +  }

contentsObj.free();

@@ +445,5 @@
> +	outputKey = gFalse;
> +	break;
> +      }
> +      excludeKey++;
> +    }

I think this could be easier to read if you use a helper function to check whether the entry should be excluded:

if (excludePageEntry(keyName))
  continue;

@@ +453,5 @@
> +      writeObject(pageDict->getValNF(i, &obj1));
> +      outputStr->printf("\n");
> +      obj1.free();
> +    }
> +  }

pageObj.free();

@@ +490,5 @@
> +  outputStr->printf("/Subtype /Form\n");
> +
> +  pageDict->lookup("MediaBox", &obj);
> +  outputStr->printf("/BBox ");
> +  writeObject(&obj);

obj.free();

@@ +494,5 @@
> +  writeObject(&obj);
> +  outputStr->printf("\n");
> +
> +  pageDict->lookup("Resources", &obj);
> +  if (!obj.isNull()) {

This could be a single line:

if (!pageDict->lookup("Resources", &obj)->isNull())
...

@@ +498,5 @@
> +  if (!obj.isNull()) {
> +    outputStr->printf("/Resources ");
> +    writeObject(&obj);
> +    outputStr->printf("\n");
> +  }

obj.free();

@@ +501,5 @@
> +    outputStr->printf("\n");
> +  }
> +
> +  pageDict->lookup("Group", &obj);
> +  if (!obj.isNull()) {

This could be a single line too.

@@ +505,5 @@
> +  if (!obj.isNull()) {
> +    outputStr->printf("/Group ");
> +    writeObject(&obj);
> +    outputStr->printf("\n");
> +  }

obj.free();

@@ +552,5 @@
> +    Stream *str = obj.getStream();
> +    writeStream(str);
> +  } else {
> +    error(errSyntaxError, -1, "Weird page contents");
> +  }

obj.free();

@@ +586,5 @@
> +  outputStr->printf("endobj\n");
> +
> +  beginIndirectObject(&resourcesRef);
> +  outputStr->printf("<< /XObject <<\n");
> +  for (int i = 0; i < (int)xobjectRefs.size(); i++)

You could use unsigned for i here too.

@@ +592,5 @@
> +  outputStr->printf(">> >>\n");
> +  outputStr->printf("endobj\n");
> +
> +  GooString content;
> +  for (int i = 0; i < (int)xobjectRefs.size(); i++) {

Ditto.

@@ +632,5 @@
> +  outputStr->printf("/Kids [");
> +  if (collate) {
> +    for (int cp = 0; cp < copies; cp++) {
> +      if (reverse) {
> +        for (int pg = (int)pageRefs[0].size() - 1; pg >= 0; pg--)

Ditto.

@@ +635,5 @@
> +      if (reverse) {
> +        for (int pg = (int)pageRefs[0].size() - 1; pg >= 0; pg--)
> +          outputStr->printf(" %d %d R", pageRefs[cp][pg].num, pageRefs[cp][pg].gen);
> +      } else {
> +        for (int pg = 0; pg < (int)pageRefs[0].size(); pg++)

Ditto.

@@ +641,5 @@
> +      }
> +    }
> +  } else {
> +    if (reverse) {
> +      for (int pg = (int)pageRefs[0].size() - 1; pg >= 0; pg--) {

Ditto.

@@ +646,5 @@
> +        for (int cp = 0; cp < copies; cp++)
> +          outputStr->printf(" %d %d R", pageRefs[cp][pg].num, pageRefs[cp][pg].gen);
> +      }
> +    } else {
> +      for (int pg = 0; pg < (int)pageRefs[0].size(); pg++)

Ditto.

@@ +662,5 @@
> +  PDFRectangle mediaSize;
> +  int outputPageNum;
> +  int nupPageNum; // 0..numberUp-1
> +  int pageNum;
> +  Object pageObj;

This is unused.

@@ +677,5 @@
> +  // are not marked as they will be included in the XObjects created for
> +  // each page.
> +  outputPageNum = 1;
> +  nupPageNum = 0;
> +  for (int i = 0; i < (int)pages.size(); i++) {

Ditto.

@@ +686,5 @@
> +      Object pageObj;
> +      Ref *refPage = doc->getCatalog()->getPageRef(pageNum);
> +      doc->getXRef()->fetch(refPage->num, refPage->gen, &pageObj);
> +      Dict *pageDict = pageObj.getDict();
> +      doc->markPageObjects(pageDict, yRef, countRef, 0, numberUp > 1 ? gFalse : gTrue);

You don't need a variable for this, you can just pass the dict to the function:

doc->markPageObjects(pageObj.getDict(), ....);

and then

pageObj.free();

@@ +711,5 @@
> +    pageRefs.push_back(refs);
> +  }
> +  outputPageNum = 1;
> +  nupPageNum = 0;
> +  for (int i = 0; i < (int)pages.size(); i++) {

unsigned

@@ +756,5 @@
> +  delete trailerDict;
> +  delete yRef;
> +  delete countRef;
> +
> +  outputStr->close();

delete outputStr;

@@ +757,5 @@
> +  delete yRef;
> +  delete countRef;
> +
> +  outputStr->close();
> +}

I think we could make this function more generic by writing a stream, and then add writeFile, that creates a FileOutpoutStream and calls the generic one passing the stream. That way frontends can add a method to write to a stream instead of creating a file in disk.

::: poppler/PDFWriter.h
@@ +27,5 @@
> +
> +  // order to layout multiple pages on a sheet.
> +  // LR_TB = Left to Right then Top to Bottom
> +  // LR_BT = Left to Right then Bottom to Top etc
> +  enum NumberUpOrder { LR_TB, LR_BT, RL_TB, RL_BT, TB_LR, TB_RL, BT_LR, BT_RL };

Even if it's a bit longer I think it's easier to read if you use full words instead of abbreviations, and you don't need the comment explaining what the abbreviations stand for.

::: poppler/Stream.cc
@@ +371,5 @@
> +void OutStream::format(const char *format, ...)
> +{
> +  va_list argptr;
> +  va_start (argptr, format);
> +  GooString *s = GooString::formatv(format, argptr);

Can we do something like this?

GooString s;
s.appendfv(format, argptr);

This way we avoid the alloc/dealloc of the GooString.

Comment 11 Adrian Johnson 2013-11-11 12:02:03 UTC

Created attachment 89020 [details] [review]
PDFWriter changes to address review comments

I've addressed most of the review comments and have attached a patch with the changes. I'll wait for the other patches to be reviewed before rolling this patch back into the original PDFWriter patch so I only have to rebase all the patches once.

The comments that I didn't fix are:

1) 

>@@ +186,5 @@
>> +      mediaSize->y2 = paperSizes[0].width;
>> +      margins->x1 = paperSizes[0].bottom;
>> +      margins->x2 = mediaSize->y2 - paperSizes[0].top;
>> +      margins->y1 = paperSizes[0].left;
>> +      margins->y2 = mediaSize->x2 - paperSizes[0].right;
>
>This block is repeated a lot, I wonder if we could add a helper function,
> something like setSizeAndMargins(PDFRectangle *mediaSize, PDFRectangle
> *margins, double with, double height, double top, double bottom, double left, 
> double right); I could be static inline

The problem I had with this is we will have a function with six parameters. When looking at a call to this function it will be hard to keep track which parameter does what. In the example quoted above the parameter list will end up being quite long so I don't see the value in turning it into a function.

2) signed vs unsigned
> > +  for (int i = 0; i < (int)paperSizes.size(); i++) {
> Why not using unsigned for i instead of casting?

I prefer to always use signed integers unless I am doing modular arithmetic or bit manipulation. Mixing signed and unsigned can produce unexpected results. So it is better to just use signed all the time. The cast adds an extra 5 characters but then so does replacing 'int' with 'unsigned'.

>> +   for (int pg = (int)pageRefs[0].size() - 1; pg >= 0; pg--) {
> Ditto.

If this is changed to unsigned the loop never exits.

Comment 12 Albert Astals Cid 2013-11-14 23:04:24 UTC

(In reply to comment #7)
> Created attachment 88598 [details] [review] [review]
> Use consistent line ending in PDFDoc
> 
> Use only '\n' in PDFDoc.

I need an english native speaker here :D

************
The carriage return (CR) and line feed (LF) characters, also called newline
characters, are treated as end-of-line (EOL) markers. The combination of a
carriage return followed immediately by a line feed is treated as one EOL marker.
For the most part, EOL markers are treated the same as any other white-space
characters. However, sometimes an EOL marker is required or recommended—
that is, the following token must appear at the beginning of a line.
************

Is it saying that only a CRLF is a EOL marker or that all CR, LF or CRLF are 1 EOL marker?

Comment 13 Albert Astals Cid 2013-11-14 23:09:45 UTC

(In reply to comment #2)
> Created attachment 88593 [details] [review] [review]
> Add DeflateStream
> 
> This patch adds a DeflateStream output stream filter for compressing the
> XObject streams in PDFWriter.

You are missing the CMake part and also you need to fail safely if zlib is not avaliable, no?

Comment 14 Albert Astals Cid 2013-11-14 23:15:54 UTC

Does this allow to write a pdf totally uncompressed? It's really helpful when you want to debug some files and at the moment i have a nasty patch around that does it, but it'd be cool if pdftopdf does it :=)

Comment 15 James Cloos 2013-11-14 23:40:10 UTC

>> The carriage return (CR) and line feed (LF) characters, also called newline
>> characters, are treated as end-of-line (EOL) markers. The combination of a
>> carriage return followed immediately by a line feed is treated as one EOL marker.

b> Is it saying that only a CRLF is a EOL marker or that all CR, LF or
b> CRLF are 1 EOL marker?

The latter.

Comment 16 Adrian Johnson 2013-11-17 11:33:12 UTC

(In reply to comment #14)
> Does this allow to write a pdf totally uncompressed? It's really helpful
> when you want to debug some files and at the moment i have a nasty patch
> around that does it, but it'd be cool if pdftopdf does it :=)

It should be easy to add this option. I'll look at it after I get the embedding of non-embedded fonts working as this is required for printing.

Comment 17 Adrian Johnson 2013-11-18 12:40:14 UTC

Created attachment 89404 [details] [review]
Font embedding

This patch adds support for embedding non-embedded fonts. Currently it only supports Type1 fonts.

The remaining issues I'm still working on are:
 - embedding of other font types
 - disable DeflateStream when zlib not available
 - cmake support for DeflateStream
 - pdftopdf option to uncompress streams

Comment 18 Marek Kasik 2013-11-25 16:18:55 UTC

Created attachment 89762 [details] [review]
Use of multiple PDFs by single PDFWriter

Hi,

regarding https://bugs.freedesktop.org/show_bug.cgi?id=69965#c5, I've added ability to combine multiple PDFs in one PDFWriter. It is done mainly by change of addition of pages. They are added as "Page *page" in "addPage(Page *page)" now instead of "int page". It is also possible to add whole documents by "addDoc(PDFDoc *doc)".
I've added corresponding glib API.

Regards

Marek

Comment 19 Adrian Johnson 2013-11-25 20:56:42 UTC

(In reply to comment #18)
>   for (int i = 0; i < (int)pages.size(); i++) {
>-    pageNum = pages[i];
>+    pageNum = pages[i]->getNum();
>+    doc = pages[i]->getDoc();
>     if (pageSet == ALL ||
>         (pageSet == ODD && outputPageNum % 2 == 1) ||
>         (pageSet == EVEN && outputPageNum % 2 == 0)) {
>       Object pageObj;
>       Ref *refPage = doc->getCatalog()->getPageRef(pageNum);
>       doc->getXRef()->fetch(refPage->num, refPage->gen, &pageObj);
>-      markPageObjects(pageObj.getDict(), numberUp > 1 ? gFalse : gTrue);
>+      markPageObjects(doc, pageObj.getDict(), numberUp > 1 ? gFalse : gTrue);
>       if (fontEmbedRequired)
>-        replaceNonEmbeddedFonts(pageObj.getDict(), &pageObj, *refPage);
>+        replaceNonEmbeddedFonts(doc, pageObj.getDict(), &pageObj, *refPage);
>       pageObj.free();
>     }

I don't see how this works. The objects from each document will be mixed together and overwrite each other. All pages from each document need to be marked and written separately with an offset applied to the object numbering of each document to prevent the object numbers from clashing.

I have already updated my patches to include multiple documents but am still reworking the font embedding patch. I wanted to get it all working before reposting but I'll go ahead and post what I already have working.

Comment 20 Adrian Johnson 2013-11-25 20:58:16 UTC

Created attachment 89774 [details] [review]
1 - Add PDFWriter

Comment 21 Adrian Johnson 2013-11-25 20:58:53 UTC

Created attachment 89775 [details] [review]
2 - pdftopdf test program

Comment 22 Adrian Johnson 2013-11-25 20:59:28 UTC

Created attachment 89776 [details] [review]
3 - Add DeflateStream

Comment 23 Adrian Johnson 2013-11-25 21:00:03 UTC

Created attachment 89777 [details] [review]
4 - glib API for PDFWriter

Comment 24 Adrian Johnson 2013-11-25 21:00:56 UTC

Created attachment 89778 [details] [review]
5 - encrypted document fix

Comment 25 Adrian Johnson 2013-11-25 21:01:34 UTC

Created attachment 89779 [details] [review]
6 - compress decrypted streams

Comment 26 Adrian Johnson 2013-11-25 21:02:07 UTC

Created attachment 89780 [details] [review]
7 - fix typo

Comment 27 Adrian Johnson 2013-11-25 21:02:46 UTC

Created attachment 89781 [details] [review]
8 - Use consistent line ending in PDFDoc

Comment 28 Adrian Johnson 2013-11-25 21:03:15 UTC

Created attachment 89782 [details] [review]
9 - gitignore

Comment 29 Adrian Johnson 2013-11-25 21:08:00 UTC

Created attachment 89784 [details] [review]
10 multiple documents

The glib API in this patch needs more work. It looks like the glib API from Marek's patch could be used.

I'm not sure whether the addPage should take PopplerPage or a page number. I avoided using PopplerPage as I thought this would be less efficient.

Comment 30 Adrian Johnson 2013-11-25 21:10:05 UTC

I'm still reworking the font embedding patch so the current patch will not apply. I'm also thinking of changing the embed fonts boolean to a choice of none/all/only embed non standard14 fonts.

Comment 31 Marek Kasik 2013-11-29 13:02:52 UTC

(In reply to comment #19)
> (In reply to comment #18)
> >   for (int i = 0; i < (int)pages.size(); i++) {
> >-    pageNum = pages[i];
> >+    pageNum = pages[i]->getNum();
> >+    doc = pages[i]->getDoc();
> >     if (pageSet == ALL ||
> >         (pageSet == ODD && outputPageNum % 2 == 1) ||
> >         (pageSet == EVEN && outputPageNum % 2 == 0)) {
> >       Object pageObj;
> >       Ref *refPage = doc->getCatalog()->getPageRef(pageNum);
> >       doc->getXRef()->fetch(refPage->num, refPage->gen, &pageObj);
> >-      markPageObjects(pageObj.getDict(), numberUp > 1 ? gFalse : gTrue);
> >+      markPageObjects(doc, pageObj.getDict(), numberUp > 1 ? gFalse : gTrue);
> >       if (fontEmbedRequired)
> >-        replaceNonEmbeddedFonts(pageObj.getDict(), &pageObj, *refPage);
> >+        replaceNonEmbeddedFonts(doc, pageObj.getDict(), &pageObj, *refPage);
> >       pageObj.free();
> >     }
> 
> I don't see how this works. The objects from each document will be mixed
> together and overwrite each other. All pages from each document need to be
> marked and written separately with an offset applied to the object numbering
> of each document to prevent the object numbers from clashing.

I checked just order of pages, I forgot to look at individual streams.


> I have already updated my patches to include multiple documents but am still
> reworking the font embedding patch. I wanted to get it all working before
> reposting but I'll go ahead and post what I already have working.

Thank you.

Comment 32 Albert Astals Cid 2013-12-12 21:32:26 UTC

Patches still not finished, right?

Comment 33 Adrian Johnson 2013-12-12 21:45:41 UTC

(In reply to comment #32)
> Patches still not finished, right?

Yes, still working on it.

Comment 34 Marek Kasik 2013-12-13 09:02:23 UTC

*** Bug 69965 has been marked as a duplicate of this bug. ***

Comment 35 Bastien Nocera 2015-01-06 16:37:10 UTC

(In reply to Adrian Johnson from comment #33)
> (In reply to comment #32)
> > Patches still not finished, right?
> 
> Yes, still working on it.

Hey Adrian, are you still working on it? This could would be very useful to write a PDF modification app (similar to PDFMod or PDF-shuffler).

Comment 36 Adrian Johnson 2015-01-07 06:19:12 UTC

(In reply to Bastien Nocera from comment #35)
> Hey Adrian, are you still working on it? This could would be very useful to
> write a PDF modification app (similar to PDFMod or PDF-shuffler).

Yes I am intending to finish this. Just need to find the time to work on it.

Comment 37 Till Kamppeter 2016-04-01 23:51:23 UTC

Adrian, any chance that you finish this? It would be great for using in cups-filters to flatten out filled PDF forms.

See https://bugs.linuxfoundation.org/show_bug.cgi?id=1315

Comment 38 Adrian Johnson 2016-04-02 02:16:19 UTC

Yes, I still intend to finish this.

Comment 39 Till Kamppeter 2017-08-13 23:28:43 UTC

Really still any plan to finish this, Adrian? There was no new comment for more than a year.

Comment 40 Adrian Johnson 2017-08-14 22:35:21 UTC

I'll revisit these patches. Maybe it would be easier to incrementally merge this functionality instead of trying to get everything working all at once. Last I recall I was trying to get the embedding of non-embedded fonts working. But this feature could be added later.

Comment 41 Albert Astals Cid 2017-08-14 22:48:52 UTC

+1 merging in small steps is always a good idea.

Comment 42 GitLab Migration User 2018-08-20 22:03:22 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/174.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.