Bug 107303 - "8" shown instead of "x" inside checkbox when converting LibreOffice-generated form to PostScript
Summary: "8" shown instead of "x" inside checkbox when converting LibreOffice-generate...
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-20 08:11 UTC by Michael Weghorn
Modified: 2018-08-21 11:09 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Sample form generated by LibreOffice (2.03 KB, application/pdf)
2018-07-20 08:11 UTC, Michael Weghorn
Details
Form with ticked checkbox and run through "mutool clean" (2.22 KB, application/pdf)
2018-07-20 08:12 UTC, Michael Weghorn
Details
Result after running "pdftops" on the file (13.09 KB, application/postscript)
2018-07-20 08:15 UTC, Michael Weghorn
Details
Patch to allow font tags other than "ZaDb" for ZapfDingbats (1.38 KB, patch)
2018-07-20 08:18 UTC, Michael Weghorn
Details | Splinter Review
Patch to allow font tags other than "ZaDb" for ZapfDingbats (1.37 KB, patch)
2018-07-20 08:42 UTC, Michael Weghorn
Details | Splinter Review

Description Michael Weghorn 2018-07-20 08:11:48 UTC
Created attachment 140724 [details]
Sample form generated by LibreOffice

Converting a LibreOffice-generated PDF form with a ticked checkbox to PostScript leads to an "8" being shown inside the check box rather than the expected "x" sign.

Steps to reproduce:

1) Open attached PDF form "simple_form.pdf" in Okular
2) tick the checkbox
3) print (either to a real printer or use "Print to File (PDF)")
4) Look at the output/printout

Result:

An "8" is shown inside of the checkbox that has been ticked.

Expected result:

The same checkmark ("x") as displayed in Okular is shown inside the checkbox on the printout.


This can also be reproduced by directly calling 'pdftops' on a PDF form saved after ticking the checkbox:

    $ pdftops simple_form_CHECKBOX_TICKED_CLEANED.pdf
    Syntax Error: Unknown font tag 'ZaDb'
    Syntax Error: Unknown font tag 'ZaDb'

(In addition to ticking the checkbox, the document has been run through 'mutool clean' to make analysis easier.)
Comment 1 Michael Weghorn 2018-07-20 08:12:37 UTC
Created attachment 140725 [details]
Form with ticked checkbox and run through "mutool clean"
Comment 2 Michael Weghorn 2018-07-20 08:15:17 UTC
Created attachment 140726 [details]
Result after running "pdftops" on the file

This is the resulting PostScript file that shows the "8" instead of the "x". I can reproduce with current git master (as of Poppler 0.67, commit 20d89699b35397f23352d0e60a3e19da2ce6b410).
Comment 3 Michael Weghorn 2018-07-20 08:18:29 UTC
Created attachment 140727 [details] [review]
Patch to allow font tags other than "ZaDb" for ZapfDingbats

As the 'pdftops' output indicates, there's a problem with the ZapfDingbats font. Poppler expects "ZaDb" to be used as the font tag, and replaces anything else with this. The LibreOffice generated form however uses "ZaDi" instead.

As far as I understand it so far, either LibreOffice uses an invalid name or Poppler makes invalid assumptions.

While the PDF specification does use "ZaDb" in its own example (in section 12.7.4.2.3), I did not find any place where it speaks of what tag has to be used, so it appears to be the latter case (but maybe I have missed something in the PDF spec or anywhere else).

The attached patch makes Poppler accept other tags for the ZapfDingbats font as well.

I'd be happy about feedback on the patch or other clarifications. (I'm far from being a Poppler expert...)
Comment 4 Michael Weghorn 2018-07-20 08:42:50 UTC
Created attachment 140728 [details] [review]
Patch to allow font tags other than "ZaDb" for ZapfDingbats

(updated patch to use correct email address)
Comment 5 Tobias Deiminger 2018-07-20 18:58:36 UTC
(In reply to Michael Weghorn from comment #3)
> The attached patch makes Poppler accept other tags for the ZapfDingbats font
> as well.
> 
> I'd be happy about feedback on the patch or other clarifications. (I'm far
> from being a Poppler expert...)
Hi Michael! I'm also far from being poppler expert, but I'd like to confirm your approach. From standards perspective it doesn't matter if the tag is named /ZaDB or /ZaDi or /whatever. What actually matters is: The tag needs a corresponding entry in a resource font sub dictionary.

From popplers perspective, we sometimes set forceZapfDingbats = true. Like with the example document, where AnnotAppearanceBuilder::drawText is reached via drawFormField => drawFormFieldButton [case formButtonCheck] => drawText. Whenever forceZapfDingbats == true, the appearance Tf operand must match our fake font resource that we hardcoded named "ZaDB". If the original DA Tf operand was different, we need to replace it with "ZaDB".

Your patch ensures this, if I got it right, and therefore it's a good patch:)

We're having a similar discussion atm. here [0], and also here [1], because of our current GSoC project. Maybe you have a look, esp. at the UML diagram [2] that shows the relationship of the different font objects and give your two cents if we're on the right track.

Btw., your attached PDF document is actually strange because it has a /DR entry in the Annot dictionary, which is not specified for Widget Annotation Dictionaries. At least not in PDF 1.7 32000-1:8. The /DR entry is meant to be in the global AcroForm dictionary. Has this changed in PDF 2.0?

[0] https://bugs.freedesktop.org/show_bug.cgi?id=81748.
[1] https://cgit.kde.org/scratch/dileepsankhla/okular-gsoc2018-typewriter.git/tree/bugs/poppler_81748
[2] https://cgit.kde.org/scratch/dileepsankhla/okular-gsoc2018-typewriter.git/plain/bugs/poppler_81748/font_object_graph.dia
Comment 6 Michael Weghorn 2018-07-30 07:24:01 UTC
Hi Tobias,

thanks for your reply with all the additional information and sorry for the delay in responding.

(In reply to Tobias Deiminger from comment #5)
> From popplers perspective, we sometimes set forceZapfDingbats = true. Like
> with the example document, where AnnotAppearanceBuilder::drawText is reached
> via drawFormField => drawFormFieldButton [case formButtonCheck] => drawText.
> Whenever forceZapfDingbats == true, the appearance Tf operand must match our
> fake font resource that we hardcoded named "ZaDB". If the original DA Tf
> operand was different, we need to replace it with "ZaDB".
> 
> Your patch ensures this, if I got it right, and therefore it's a good patch:)

It doesn't really. E.g. for the example document, the font resource is no longer replaced with the fake one. It was before (i.e. without the patch), but the font resource was not found. Now, the original font resource with the "ZaDi" tag is used -- but if I understand you correctly, this might not be desirable if Poppler relies on the "ZaDb" being used at other places for the 'forceZapfDingbats' case...
Should I rather have a look why the "ZaDb" one is not found (like indicated by the pdftops output: "Syntax Error: Unknown font tag 'ZaDb'")?

I'll try to have a closer look at all the points you mentioned sometime soon, but only have limited time available at the moment, so can't really say when that will be.
Comment 7 Tobias Deiminger 2018-07-30 13:06:20 UTC
(In reply to Michael Weghorn from comment #6)
> > Your patch ensures this, if I got it right, and therefore it's a good patch:)
> 
> It doesn't really. E.g. for the example document, the font resource is no
> longer replaced with the fake one. It was before (i.e. without the patch),
> but the font resource was not found. Now, the original font resource with
> the "ZaDi" tag is used -- but if I understand you correctly, this might not
> be desirable if Poppler relies on the "ZaDb" being used at other places for
> the 'forceZapfDingbats' case...

Was on the wrong track too (I messed up with the return value of GooString::cmp). Now I think the original code without patch is fine already, at least wrt my above assertions.

I've just learned new things about poppler. When printing into PDF, poppler obviously removes widget annotations, and replaces them with simple Content items. Guess this is required because PostScript doesn't support annotations? Anyway, AnnotAppearanceBuilder is then no longer responsible for displaying the "8" in the printout.

The original simple_form_CHECKBOX_TICKED_CLEANED.pdf contains a widget annotation, representing the check button:

3 0 obj
<<
  /Type /Annot
  /Subtype /Widget

  /DR <<
    /Font <<
      /ZaDi 4 0 R
    >>
  >>
  /DA (0.13725 0.14901 0.15294 rg /ZaDi 0 Tf)
  /MK <<
    /CA (8)
  >>
>>
endobj

Notably, /CA is string "8" ("the widget annotation's normal caption which shall be displayed when it is not interacting with the user").

Now, when printed, the Widget object is gone. My decompressed printout.pdf instead contains this:

5 0 obj
<< /Contents 6 0 R ... >>
6 0 obj
<< /Length 376 >>
stream
/R8 10.9815 Tf
[(8)-77]TJ
... % shortened
ET
10 0 obj
<< /BaseFont /DZBPUO+F1348788328_100000 /Encoding /WinAnsiEncoding /FirstChar 56 /FontDescriptor 11 0 R /LastChar 56 /Subtype /Type1 /Type /Font /Widths [ 600 ] >>
11 0 obj
<< /Ascent 616 /AvgWidth 600 /CapHeight 616 /CharSet (/eight) /Descent -15 /Flags 65569 /FontBBox [ 0 -15 493 616 ] /FontFile3 12 0 R /FontName /DZBPUO+F1348788328_100000 /ItalicAngle 0 /MaxWidth 600 /MissingWidth 600 /StemV 73 /Type /FontDescriptor >>
endobj
12 0 obj
<< /Subtype /Type1C /Length 396 >>
... % embedded font here

Here, the TJ ("show text") operator writes string "8". The "8" got copied from MK CA of the very original document simple_form.pdf. Tf selects font /R8. R8 maps to Font Dictionary obj 10 0.  This is an embedded font that has only one character 56 defined. 56 is ASCII for "8".

So the "8" sign appears on screen/printout, and that's exactly what the PDF wants to happen. I'm not sure who we should accuse then. Maybe the software that originally wrote "8" into /CA? Does this longish post make sense at all?

> Should I rather have a look why the "ZaDb" one is not found (like indicated
> by the pdftops output: "Syntax Error: Unknown font tag 'ZaDb'")?

I believe the Syntax Error is unrelated to the problem. But would be interesting where it originates anyway.
Comment 8 Tobias Deiminger 2018-07-30 16:47:44 UTC
Ah, I see... character 56 (="8" in ASCII) is a "cross symbol X" in the zapf dingbats font. So it makes some sense to have [(8)-77]TJ in the printed variant. Sadly the embedded font became Nimbus Mono PS, which has no cross symbol at 56, and "8" is drawn as digit.

I could not yet discover the place where the [(8)-77]TJ gets formed. An obvious location to generate the stream is AnnotAppearanceBuilder::drawText, but I debugged it and it produces slightly different content
q
BT
0.13725 0.14901 0.15294 rg /ZaDb 11.00 Tf 1 0 0 1 2.43 1.55 Tm
(8) Tj
ET
Q

Maybe AnnotAppearanceBuilder::drawText is used, and there is some post processing that I'm not aware of? Michael, do you know?

Anyway there seems to be a fundamental problem. All the Annotation classes dynamically generate in-memory appearance streams and may depend on in-memory resources. If we simply take this generated appearance streams and write them into a PDF file for printing, then dependent in-memory resources like the fake font are missing. We would have to write the resource objects to the PDF too, but that's not yet done.

In your patch you prefer an existing zapf dingbats font over the in-memory fake font which works then. If we had a document with no zapf dingbat font and no CA defined, then GooString checkMark("3") will be used (see AnnotAppearanceBuilder::drawFormFieldButton) and we get the same bug again, is it?
Comment 9 Michael Weghorn 2018-08-03 07:17:57 UTC
I have had a closer look at some aspects now.

(In reply to Tobias Deiminger from comment #5)
> We're having a similar discussion atm. here [0], and also here [1], because
> of our current GSoC project. Maybe you have a look, esp. at the UML diagram
> [2] that shows the relationship of the different font objects and give your
> two cents if we're on the right track.
Thanks for mentioning these, there's lots of helpful information.
The UML diagram looks good to me. (I just realized that not all members are shown for all types, e.g. the 'CapHeight' member for the 'Font descriptor' is not mentioned, and some of the font dictionary members mentioned in section 9.6.2 in the PDF spec, but that may be intentional.)


> Btw., your attached PDF document is actually strange because it has a /DR
> entry in the Annot dictionary, which is not specified for Widget Annotation
> Dictionaries. At least not in PDF 1.7 32000-1:8. The /DR entry is meant to
> be in the global AcroForm dictionary. Has this changed in PDF 2.0?
I also can't find a specification for the the '/DR' for the Annot dictionary in the PDF 1.7 spec. I don't know about PDF 2.0, but at a quick glance, the corresponding code in LibreOffice has been there for a long time, so I doubt it's related to any newer PDF standard.
However (as far as I can see), the behaviour is still the same after manually removing the '/DR' entry from the Annot dictionary (object '3 0 obj'). (The AcroForm dictionary also specifies the font in its '/DR' entry.)

(In reply to Tobias Deiminger from comment #8)
> I could not yet discover the place where the [(8)-77]TJ gets formed. An
> obvious location to generate the stream is AnnotAppearanceBuilder::drawText,
> but I debugged it and it produces slightly different content
> q
> BT
> 0.13725 0.14901 0.15294 rg /ZaDb 11.00 Tf 1 0 0 1 2.43 1.55 Tm
> (8) Tj
> ET
> Q
> 
> Maybe AnnotAppearanceBuilder::drawText is used, and there is some post
> processing that I'm not aware of? Michael, do you know?

Just to be sure: Are you using the "Print to File (PDF)" option from Okular to print to PDF? (I can reproduce the behaviour when doing so.) In this case, Okular first generates a PostScript file using Poppler's PSConverter, and then runs `ps2pdf` on that file (s. method `FilePrinter::doPrintFiles` in `core/fileprinter.cpp`, therefore the related PDF code should be be formed in that conversion done by Ghostscript (with `ps2pdf` being a Ghostscript tool). Therefore, two conversions are actually involved (PDF -> PS -> PDF).



> 
> Anyway there seems to be a fundamental problem. All the Annotation classes
> dynamically generate in-memory appearance streams and may depend on
> in-memory resources. If we simply take this generated appearance streams and
> write them into a PDF file for printing, then dependent in-memory resources
> like the fake font are missing. We would have to write the resource objects
> to the PDF too, but that's not yet done.
> 
> In your patch you prefer an existing zapf dingbats font over the in-memory
> fake font which works then. If we had a document with no zapf dingbat font
> and no CA defined, then GooString checkMark("3") will be used (see
> AnnotAppearanceBuilder::drawFormFieldButton) and we get the same bug again,
> is it?

Yes, I think the problem reappears then. So if I understand correctly,
what should be done is to write the objects currently only created
in-memory to the PDF document and this would solve the problem for both
cases (the original document and the case you describe here).

Still, one aspect that I currently haven't understood is why
`forceZapfDingbats` is always set to 'true' whenever a checkbox is drawn
via `AnnotAppearanceBuilder::drawFormFieldButton [case formButtonCheck]`.
Do you know why?

My (maybe naive) expectation without further examination would have been
that an explicitly specified font is used if there is any, rather than
always forcing ZapfDingbats (using the interactive form dicts `DR` entry
as specified in Section 12.7.2 of the PDF 1.7 spec, table 218).

In that case, I'd currently see two cases that could be distinguished:

1) If the document supplies proper information and resources for the font,
those should be used (e.g. as with the given sample document here).

2) Otherwise ZapfDingbats is used and all required resources are saved in
the document as well.

Does this make sense or did I miss any reason for using ZapfDingbats
unconditionally? (like one ould never want anything else than
ZapfDinbats's '8' (check mark) in a checkbox anyway)

As far as I understand, the visual result would be the same for
implementing a solution for either 1) or 2) for the given sample document
(since ZapfDingbats is used in both cases), but other documents might behave differently.

Please also let me know in case I missed to reply to any other question or aspect you mentioned.

Another interesting thing I realized is that using Poppler's 'pdftocairo' results in a PDF file that has the check mark shown properly (even though the same warning about the unknown font tag is being shown); command:

    $ pdftocairo -pdf simple_form_CHECKBOX_TICKED_CLEANED.pdf fromCairo.pdf
    Syntax Error: Unknown font tag 'ZaDb'

I haven't had a closer look at this so far.
Comment 10 GitLab Migration User 2018-08-21 11:09:45 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/541.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.