Bug 90066

Summary: pdfunite: embedded files discarded during merge
Product: poppler Reporter: Vovodroid <qa23d-vvd>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Test files
insert embedded files in result pdf
insert embedded files in result pdf
two embedded files
insert embedded files in result pdf
insert embedded files in result pdf
insert embedded files in result pdf

Description Vovodroid 2015-04-17 06:49:54 UTC
Created attachment 115149 [details]
Test files

When merging pdf file with embedded attachment it not present in resulted pdf (I use Acrobat Reader to see it).

Two files attached in zip: attach.pdf with embedded file, and merged without
Comment 1 Thomas Freitag 2015-04-17 13:33:12 UTC
Created attachment 115160 [details] [review]
insert embedded files in result pdf

This patch solves the issue with pdfunite
Comment 2 Thomas Freitag 2015-04-17 13:34:17 UTC
Created attachment 115161 [details] [review]
insert embedded files in result pdf

This patch solves the issue with pdfseparate
Comment 3 Vovodroid 2015-04-17 14:42:42 UTC
Created attachment 115163 [details]
two embedded files

I testes patch for pdfunite, it better now, thanks!

But attachment in first file is preserved, attachments in next files are discarded.
Comment 4 Vovodroid 2015-04-17 14:44:24 UTC
I tested patch for pdfunite, it better now, thanks!

But only attachment in first file is preserved, attachments in second file is discarded.
Comment 5 Thomas Freitag 2015-04-18 11:52:36 UTC
Created attachment 115174 [details] [review]
insert embedded files in result pdf

This patch merges the name trees in the catalog dicts correctly, so all embedded files remains in the result pdf created by pdfunite.

One hint to pdfseparate: since pdfseparate is not able to detect which embedded files are used by the extracted page, all embedded files are inserted in every separated pdf.
Comment 6 Vovodroid 2015-04-18 13:16:34 UTC
Great, thanks Thomas!

Can this fix be committed to git?
Comment 7 Jason Crain 2015-04-21 02:24:31 UTC
Comment on attachment 115174 [details] [review]
insert embedded files in result pdf

Review of attachment 115174 [details] [review]:
-----------------------------------------------------------------

::: utils/pdfunite.cc
@@ +42,5 @@
> +  Object srcNameArray;
> +  mergeNameTree->lookup("Names", &mergeNameArray);
> +  srcNameTree->lookup("Names", &srcNameArray);
> +  if (mergeNameArray.isArray() && srcNameArray.isArray()) {
> +    Object *newNameArray = new Object();

You are leaking newNameArray in a couple places.  srcNameTree->set/add will make a shallow copy so you shouldn't call newNameArray->free, but you still need to delete the Object.  Or use a stack variable.

@@ +61,5 @@
> +            if (mkey.getString()->cmp(key.getString()) < 0) {
> +              Object *newKey = new Object();
> +	      newKey->initString(new GooString(mkey.getString()->getCString()));
> +              newNameArray->arrayAdd(newKey);
> +              Object *newValue = new Object();

same, newKey and newValue are being leaked in a few places.  Object::arrayAdd makes a shallow copy.

@@ +249,5 @@
>      }
> +    catDict->lookup("Names", &names);
> +    if (!names.isNull() && names.isDict()) {
> +      docs[0]->markPageObjects(names.getDict(), yRef, countRef, 0, refPage->num, refPage->num);
> +    }

You are processing the first document twice, first here and again below.
Comment 8 Thomas Freitag 2015-04-21 08:02:54 UTC
(In reply to Jason Crain from comment #7)

Thanks for reviewing. I hadn't done a memory leak check, sorry, normally I do that. The next patch will remove the memory leaks.

BUT:
> @@ +249,5 @@
> >      }
> > +    catDict->lookup("Names", &names);
> > +    if (!names.isNull() && names.isDict()) {
> > +      docs[0]->markPageObjects(names.getDict(), yRef, countRef, 0, refPage->num, refPage->num);
> > +    }
> 
> You are processing the first document twice, first here and again below.

That's not true. The problem is probably the name "markPageObjects". But the first parameter is a dictionary, so probably a better name for this method should be markDictObjects. So here it marks every object referenced by the "Names"-dict in the catalog dict, and this is the only place where markPageObjects is called with it (in doMergeNameTree it will be called for the "Names"-dict of the othere pages!).
The name for the method "markPageObjects" has historical reasons. When I developped it first it was just called with the page dictionary!
Comment 9 Thomas Freitag 2015-04-21 08:07:19 UTC
Created attachment 115244 [details] [review]
insert embedded files in result pdf

This patch removes the memory leaks by my first patch. In additions it closes also two other small leaks in pdfunite.

Tested with valgrind
Comment 10 Vovodroid 2015-05-27 06:23:48 UTC
This patch works well, so any plans to merge patch to trunk?

Thanks.
Comment 11 Albert Astals Cid 2015-07-15 07:41:40 UTC
Thomas, don't
 GooString *newKey = new GooString("Names");
and
 GooString *newKey = new GooString(key);

leak too?
Comment 12 Albert Astals Cid 2015-07-15 07:44:18 UTC
Also is "This patch solves the issue with pdfseparate" unneeded?
Comment 13 Thomas Freitag 2015-07-20 10:11:46 UTC
(In reply to Albert Astals Cid from comment #11)
> Thomas, don't
>  GooString *newKey = new GooString("Names");
> and
>  GooString *newKey = new GooString(key);
> 
> leak too?

Sorry for the delay. I just looked into it again, and yes, in case of keys in a Dict the choice of "new GooString" is definitely wrong, it could not cause only a memory leak, in case of the static version of GooString it could also cause memory corruption. I have to use copyString in this cases. I wonder a little bit why valgrind didn't show me that problem.

I will upload a new patch in a few minutes.
Comment 14 Thomas Freitag 2015-07-20 10:24:46 UTC
(In reply to Albert Astals Cid from comment #12)
> Also is "This patch solves the issue with pdfseparate" unneeded?

Yes. At the beginning I uploaded two patches. These were combined in one patch after the first correction.
Comment 15 Thomas Freitag 2015-07-20 14:37:21 UTC
Created attachment 117257 [details] [review]
insert embedded files in result pdf

Use copyString where necessary
Comment 16 Albert Astals Cid 2015-08-27 21:02:41 UTC
Pushed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.