Bug 11612 - pack as many glyphs as possible in each cairo_show_glyphs() call
Summary: pack as many glyphs as possible in each cairo_show_glyphs() call
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: cairo backend (show other bugs)
Version: unspecified
Hardware: Other All
: medium major
Assignee: Kristian Høgsberg
QA Contact: cairo-bugs mailing list
URL: http://ousia.iespana.es/pdf/tesis.utf...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-07-15 10:09 UTC by Pablo Rodríguez
Modified: 2009-08-04 22:02 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Pablo Rodríguez 2007-07-15 10:09:49 UTC
Hi there,

using cairo to print pdf documents (throught evince), I'm afraid that it generates  PDF documents twice as big as the original PDF documents.

In the URL you will find a document generated using dvipdfmx and the cairo generated document is almost twice as big as the original document.

I guess the cairo generated document should include superfluous or redundant information. I guess it might be room for improvements in the PDF generation process.

I hope it helps,


Pablo
Comment 1 Kristian Høgsberg 2007-07-16 12:28:51 UTC
We don't actually use cairo for printing in poppler, but it would be a good idea.  Not sure how the sizes are going to compare in that case, but it's worth a try.  I'm moving this bug to poppler, but it's probably as much an evince bug/feature.
Comment 2 Pablo Rodríguez 2007-07-16 12:49:45 UTC
Thanks for the answer, Kristian.

Are you sure that evince-0.9.2 (with poppler-0.5.9 + cairo-backend) doesn't use cairo for printing to PDF files?

I printed a dissertation using evince and the output is 4.5 times bigger as the original file (this is the worst case I've seen):

$ ls -lh output.pdf TesisRosa-B5-Uni.pdf 
-rw-r--r-- 1 ousia ousia 6,3M Jul 16 21:43 output.pdf
-rw-r--r-- 1 ousia ousia 1,4M Dec  1  2006 TesisRosa-B5-Uni.pdf

The PDF document shows that the creator and generator is cairo-1.4.10:

$ pdfinfo output.pdf 
Creator:        cairo 1.4.10 (http://cairographics.org)
Producer:       cairo 1.4.10 (http://cairographics.org)
Tagged:         no
Pages:          366
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
File size:      6581079 bytes
Optimized:      no
PDF version:    1.4

And those CairoFonts should have been renamed by cairo itself.

$ pdffonts output.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
CairoFont-8-0                        CID TrueType      yes no  yes   1105  0
NimbusRomanNo9L                      Type 1            yes no  yes   1110  0
NimbusSansL                          Type 1            yes no  yes   1115  0
NimbusRomanNo9L                      Type 1            yes no  yes   1120  0
CairoFont-0-0                        CID TrueType      yes no  yes   1126  0
CairoFont-1-0                        CID TrueType      yes no  yes   1132  0
NimbusSansL                          Type 1            yes no  yes   1137  0
CairoFont-4-0                        CID TrueType      yes no  yes   1143  0
NimbusSansL                          Type 1            yes no  yes   1148  0

Doesn't cairo generate the PDF file when printing from evince to a PDF file?

Thanks for your help,


Pablo
Comment 3 Kristian Høgsberg 2007-07-16 15:31:51 UTC
Oh, I guess I'm not up to date on things here.  It's good that evince/poppler uses the cairo backend for printing but the output is pretty big.  Cairo used to output a lot of code for regular text, but that's been optimized since.

I'm not sure what's wrong here, and I haven't worked with the cairo code in a while... Carl, maybe someone else should be the owner of PDF backend bugs?
Comment 4 Adrian Johnson 2007-07-16 17:24:27 UTC
> I printed a dissertation using evince and the output is 4.5 times bigger as the
> original file (this is the worst case I've seen):
> 
> $ ls -lh output.pdf TesisRosa-B5-Uni.pdf 
> -rw-r--r-- 1 ousia ousia 6,3M Jul 16 21:43 output.pdf
> -rw-r--r-- 1 ousia ousia 1,4M Dec  1  2006 TesisRosa-B5-Uni.pdf
> 
> The PDF document shows that the creator and generator is cairo-1.4.10:

Could you provide a link to the original PDF and the output from cairo. 

> And those CairoFonts should have been renamed by cairo itself.
> 
> $ pdffonts output.pdf 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> CairoFont-8-0                        CID TrueType      yes no  yes   1105  0
> NimbusRomanNo9L                      Type 1            yes no  yes   1110  0
> NimbusSansL                          Type 1            yes no  yes   1115  0
> NimbusRomanNo9L                      Type 1            yes no  yes   1120  0
> CairoFont-0-0                        CID TrueType      yes no  yes   1126  0
> CairoFont-1-0                        CID TrueType      yes no  yes   1132  0
> NimbusSansL                          Type 1            yes no  yes   1137  0
> CairoFont-4-0                        CID TrueType      yes no  yes   1143  0
> NimbusSansL                          Type 1            yes no  yes   1148  0
> 

Looks like what is happening here is that the embedded TrueType fonts in the original PDF have had the "name" tables stripped out during subsetting. The font names exist only in the PDF font dictionaries. When cairo embeds the font in the new PDF there is no fontname available in the font so the CairoFont-x-y name is used instead.
Comment 5 Pablo Rodríguez 2007-07-17 11:54:36 UTC
(In reply to comment #4)
> > I printed a dissertation using evince and the output is 4.5 times bigger as the
> > original file (this is the worst case I've seen):
> > 
> > $ ls -lh output.pdf TesisRosa-B5-Uni.pdf 
> > -rw-r--r-- 1 ousia ousia 6,3M Jul 16 21:43 output.pdf
> > -rw-r--r-- 1 ousia ousia 1,4M Dec  1  2006 TesisRosa-B5-Uni.pdf
> > 
> > The PDF document shows that the creator and generator is cairo-1.4.10:
> 
> Could you provide a link to the original PDF and the output from cairo. 

You can find the original PDF at http://ousia.en.eresmas.com/TesisRosa-B5-Uni.pdf and the output from cairo at http://ousia.en.eresmas.com/output.pdf.

Please, those files are released for testing purposes only. As soon as the files have been checked I would appreciate a note to erase the files from the website.

Thanks for your work and your help,


Pablo
Comment 6 Adrian Johnson 2007-07-17 15:43:30 UTC
(In reply to comment #5)
> and the output from cairo at
> http://ousia.en.eresmas.com/output.pdf.

I get a page not found on this file.

Comment 7 Pablo Rodríguez 2007-07-18 08:57:15 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > and the output from cairo at
> > http://ousia.en.eresmas.com/output.pdf.
> 
> I get a page not found on this file.

Sorry, you can find it at http://ousia.iespana.es/pdf/output.pdf.

At least it works for me now.


Pablo
Comment 8 Adrian Johnson 2007-07-18 14:49:10 UTC
What's happening here is that evince is calling cairo_show_glyphs() with one glyph at a time. Each time cairo_show_glyphs() is called the PDF ouput selects the pattern, selects the font, and initializes the text matrix. The result is about 110 bytes of overhead per glyph before compression.
Comment 9 Pablo Rodríguez 2007-07-19 10:39:18 UTC
Sorry, but I'm not a developer and I don't understand the issue.

Is this a bug in cairo or in the way evince invokes cairo?

Thanks for your help,


Pablo
Comment 10 Behdad Esfahbod 2007-09-25 14:17:40 UTC
Adrian, reassign to Poppler and let Jeff deal with it then?
Comment 11 Adrian Johnson 2007-09-25 14:27:00 UTC
(In reply to comment #10)
> Adrian, reassign to Poppler and let Jeff deal with it then?
> 

Reassigning to poppler. Poppler should pack as many glyphs as possible into each show_glyphs() call to get efficient PS/PDF output from cairo.
Comment 12 Pablo Rodríguez 2007-12-21 08:20:20 UTC
Just in case it helps.

Using poppler-0.6.3, cairo-1.5.4 and evince 2.21.1, the resulting file is even bigger:

ls -lh otput.pdf TesisRosa-B5-Uni.pdf 
-rw-r--r-- 1 ousia guest 6,6M 2007-12-21 17:11 otput.pdf
-rw-r--r-- 1 ousia guest 1,4M 2007-12-21 16:54 TesisRosa-B5-Uni.pdf


Pablo
Comment 13 Pablo Rodríguez 2008-02-03 13:14:27 UTC
Updating summary for a more accurate description.

This seems to be required for a more efficient PS/PDF generation from cairo.
Comment 14 Behdad Esfahbod 2008-02-04 13:00:08 UTC
(In reply to comment #13)
> Updating summary for a more accurate description.
> 
> This seems to be required for a more efficient PS/PDF generation from cairo.

Both that, and it increases viewing performance too.
Comment 15 Adrian Johnson 2008-02-04 13:33:04 UTC
One of the problems with calling cairo_show_glyphs() with one glyph at a time is that when text knockout is true (the default) overlapping transparent glyphs in each text object will composite with each other. Poppler needs to call cairo_show_glyphs() with all glyphs in the text object to ensure that the glyphs do not composite with each other. If TK is false poppler needs to call cairo_show_glyphs() with one glyph at a time.
Comment 16 Adrian Johnson 2008-02-04 14:27:29 UTC
(In reply to comment #15)
> One of the problems with calling cairo_show_glyphs() with one glyph at a time
> is that when text knockout is true (the default) overlapping transparent glyphs
> in each text object will composite with each other. Poppler needs to call
> cairo_show_glyphs() with all glyphs in the text object to ensure that the
> glyphs do not composite with each other. If TK is false poppler needs to call
> cairo_show_glyphs() with one glyph at a time.
> 

I should also add that PDF can change the font, font scale, and maybe other graphics state (the PDF reference is not clear on this) inside a a text object. As cairo only supports cairo_show_glyphs() with the same font, font scale, and pattern, poppler should, when TK=true, draw all text in a group with CAIRO_OPERATOR_SOURCE then paint the group onto the page. This would be slower and result in image fallbacks when printing so poppler should only do this if it is not possible to draw all the text in the text object with a single cairo_show_glyphs() call.
Comment 17 Adrian Johnson 2008-06-04 07:22:21 UTC
I've committed some changes to cairo that packs glyphs from multiple calls to show_glyphs into the one string. Using your test case the changes to the PDF output size are as follows:

Before:
 1.4M  TesisRosa-B5-Uni.pdf
 6.3M  output.pdf

After:
 1.4M  TesisRosa-B5-Uni.pdf
 1.8M  output.pdf
Comment 18 Pablo Rodríguez 2008-06-04 10:39:39 UTC
Many thanks for the improvement, Adrian.

Just out of curiosity (and I'm no PDF expert, so I don't know whether the following question is nonsense), wouldn't it be possible that poppler/cairo generates a smaller PDF document than the original one?

Thanks again for your excellent work,


Pablo 
Comment 19 Adrian Johnson 2008-06-05 03:45:33 UTC
(In reply to comment #18)
> Many thanks for the improvement, Adrian.
> 
> Just out of curiosity (and I'm no PDF expert, so I don't know whether the
> following question is nonsense), wouldn't it be possible that poppler/cairo
> generates a smaller PDF document than the original one?

Only if the original PDF creator was very inefficient in the way it generated the PDF and the particular inefficiencies are things that cairo can optimize - such as the string merging that I recently committed.

Generally you are going to see an increase in size when doing PDF->Poppler->Cairo->PDF. Of course we would like to keep the increase as small as possible and there a few more optimizations that can be done to further reduce the size with out losing any information. For example keeping JPEG images in JPEG format in one such optimization planned for cairo.

But converting a PDF to PDF is generally not an interesting operation. You already have the file in PDF format. If you are looking to further reduce the size of your PDF there are specialist tools for processing PDF files that can do this.
Comment 20 Carlos Garcia Campos 2009-08-04 10:03:54 UTC
Is this still valid? Poppler uses a glyph array for every string, and cairo produces smaller PDF output files now. I think we can just close this
Comment 21 Carl Worth 2009-08-04 12:56:35 UTC
(In reply to comment #5) 
> You can find the original PDF at
> http://ousia.en.eresmas.com/TesisRosa-B5-Uni.pdf and the output from cairo at
> http://ousia.en.eresmas.com/output.pdf.
> 
> Please, those files are released for testing purposes only. As soon as the
> files have been checked I would appreciate a note to erase the files from the
> website.

Thanks very much for the bug report, Pablo. You can certainly remove those files from the website now.

(In reply to comment #20)
> Is this still valid? Poppler uses a glyph array for every string, and cairo
> produces smaller PDF output files now. I think we can just close this

It sure looks ready to close to me, so I'll go ahead and do that.

-Carl
Comment 22 Behdad Esfahbod 2009-08-04 22:02:15 UTC
I know at some point I noticed this bug and at the time the poppler cairo code indeed seemed to do the right thing, but its higher level called it with one glyph at a time.  The shrinkage in the PDF as of recent may be just caused by cairo merging multiple show_glyphs() calls now.  So, unless someone can actually point to a commit that has fixed this, I think the issue should be investigated further.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.