Bug 99167

Summary: do not create duplicates of the same objects
Product: poppler Reporter: Trevor Spiteri <tspiteri>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: enhancement    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: The pdf document used in the example output.

Description Trevor Spiteri 2016-12-21 13:58:31 UTC
Created attachment 128605 [details]
The pdf document used in the example output.

When I separate a pdf into pages and then unite it again, fonts are duplicated. The attached document a.pdf uses a regular font in page 1, and in page 2 it uses the same regular font as well as a bold font. After separating and uniting, the united file contains two copies of the regular font. It would be nice if the tools removed the identical duplicated fonts.

Details of an example are included below. Note that in the output to pdffonts, I removed the following columns which were common to all fonts:

type              encoding         emb sub uni
----------------- ---------------- --- --- ---
Type 1            Custom           yes yes no

File sizes and font information:

$ wc -c a.pdf; pdffonts a.pdf
14244 a.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0
EBGCWF+NimbusRomNo9L-Medi                10  0

$ pdfseparate a.pdf b%d.pdf

$ wc -c b1.pdf; pdffonts b1.pdf
8404 b1.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0

$ wc -c b2.pdf; pdffonts b2.pdf
15120 b2.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0
EBGCWF+NimbusRomNo9L-Medi                10  0

$ pdfunite b1.pdf b2.pdf c.pdf

$ wc -c c.pdf; pdffonts c.pdf
22916 c.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0
ULOTVD+NimbusRomNo9L-Regu                23  0
EBGCWF+NimbusRomNo9L-Medi                29  0

$ pdfseparate c.pdf d%d.pdf

$ wc -c d1.pdf; pdffonts d1.pdf
8061 d1.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0

$ wc -c d2.pdf; pdffonts d2.pdf
14778 d2.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                23  0
EBGCWF+NimbusRomNo9L-Medi                29  0

$ pdfunite d1.pdf d2.pdf e.pdf

$ wc -c e.pdf; pdffonts e.pdf
23296 e.pdf
name                                 object ID
------------------------------------ ---------
ULOTVD+NimbusRomNo9L-Regu                 4  0
ULOTVD+NimbusRomNo9L-Regu                42  0
EBGCWF+NimbusRomNo9L-Medi                48  0
Comment 1 Thomas Freitag 2017-01-06 14:50:02 UTC
You know that the fonts are identical! But if You provide me a good algorithm which compares two embedded fonts and returns if they are identical or not, i.e.

- the encoding is the same
- all glyphs needed for both usages of the fonts are included
- all glyphs have the same character width, height and looks the same

and convince me that your algorithm works in all use cases, I will think about to provide a patch for pdfunite.
Comment 2 GitLab Migration User 2018-08-20 21:47:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/76.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.