101855 – Embedded TrueType Symbols with accents not rendered correctly

Bug 101855 - Embedded TrueType Symbols with accents not rendered correctly

Summary: Embedded TrueType Symbols with accents not rendered correctly

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:	patch

Duplicates (1):	101624 (view as bug list)
Depends on:
Blocks:

Reported:	2017-07-20 16:57 UTC by Simon Shugar
Modified:	2017-08-10 07:41 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
The PDF with the problem to reproduce. (2.26 MB, application/pdf) 2017-07-20 16:57 UTC, Simon Shugar	Details
Expected Result (408.65 KB, image/png) 2017-07-20 16:58 UTC, Simon Shugar	Details
Actual Result (404.30 KB, image/png) 2017-07-20 16:59 UTC, Simon Shugar	Details
Patch to add useMacRoman flag (796 bytes, patch) 2017-07-20 18:05 UTC, Simon Shugar	Details \| Splinter Review
Use unicode cmap if it exists (1.93 KB, patch) 2017-08-01 12:59 UTC, Thomas Freitag	Details \| Splinter Review
View All

Description Simon Shugar 2017-07-20 16:57:56 UTC

Created attachment 132790 [details]
The PDF with the problem to reproduce.

Summary:
The uploaded PDF "pdfwithproblem" doesn't render correctly when converted to an image. The character ë for example is not rendered correctly for Arial Narrow and Calibri. All fonts on the document are embedded TrueType fonts with  WinAnsi encoding. However ArialNarrow and Calibri are rendered using the macRomanCmap. 

Steps
1. Run "pdfwithproblem" through pdftoppm to create an image.

Expected:
Accented characters render correctly. Please see EXPECTED_pdfwithproblem

Actual
Accented characters are not rendered correctly. Please see ACTUAL_pdfwithproblem.

Comment 1 Simon Shugar 2017-07-20 16:58:48 UTC

Created attachment 132791 [details]
Expected Result

Comment 2 Simon Shugar 2017-07-20 16:59:04 UTC

Created attachment 132792 [details]
Actual Result

Comment 3 Simon Shugar 2017-07-20 17:08:25 UTC

Update:

After investigating the code with a colleague we discovered that in GfxFont.cc#*Gfx8BitFont::getCodeToGIDMap(FoFiTrueType *ff) ArialNarrow and Calibri are going through logic for (flags & fontSymbolic) && macRomanCmap >= 0) (line 1747 in 0.56.0). In the logic the flag for useMacRoman is never set. I have limited knowledge of the code base but by setting this flag within else if statement my document renders correctly. 

Interestingly all my fonts are meant to be WinAnsi encoded but ArialNarrow and Calibri come through as MacRoman (see pdffonts out below). Though after doing a bit of searching this article (https://blog.idrsolutions.com/2010/01/embedded-pdf-truetype-fonts-are-always-mac-encoded-unless-they-are-not/) states that most embedded true type fonts come through as MacRoman. If anyone is able to clarify this please let me know. 

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Times New Roman                      TrueType          WinAnsi          yes no  no       8  0
ArialNarrow-Bold                     TrueType          WinAnsi          yes no  no       9  0
Arial,Bold                           TrueType          WinAnsi          yes no  no      10  0
Calibri-Bold                         TrueType          WinAnsi          yes no  no      11  0

Comment 4 Simon Shugar 2017-07-20 17:28:05 UTC

Update:

Debugging GfxFont.cc#*Gfx8BitFont::getCodeToGIDMap(FoFiTrueType *ff) shows me what path each font takes. 

hasEncoding: 1
type: 5
fontType1: 1
flags: 32
fontSymbolic: 4
msSymbolCmap: -1
macRomanCmap: 1
cmap: 2
logic: (!(flags & fontSymbolic) || embFontID.num < 0
========================================================
hasEncoding: 1
type: 5
fontType1: 1
flags: 4
fontSymbolic: 4
msSymbolCmap: -1
macRomanCmap: 0
cmap: 0
logic: (flags & fontSymbolic) && macRomanCmap >= 0)
========================================================
hasEncoding: 1
type: 5
fontType1: 1
flags: 32
fontSymbolic: 4
msSymbolCmap: -1
macRomanCmap: 1
cmap: 2
logic: (!(flags & fontSymbolic) || embFontID.num < 0
========================================================
hasEncoding: 1
type: 5
fontType1: 1
flags: 4
fontSymbolic: 4
msSymbolCmap: -1
macRomanCmap: 1
cmap: 1
logic: (flags & fontSymbolic) && macRomanCmap >= 0)
========================================================

Comment 5 Simon Shugar 2017-07-20 18:05:34 UTC

Created attachment 132794 [details] [review]
Patch to add useMacRoman flag

Added useMacRoman = gTrue to ((flags & fontSymbolic) && macRomanCmap >= 0)  else if. 

My question is should we just remove ((flags & fontSymbolic) && macRomanCmap >= 0) and rely on (macRomanCmap >= 0) or does it have a purpose? I looked at the history and this logic was added as a merge from XPDF however the comment about the logic was never updated. 

Comment Logic:

  // To match up with the Adobe-defined behaviour, we choose a cmap
  // like this:
  // 1. If the PDF font has an encoding:
  //    1a. If the PDF font specified MacRomanEncoding and the
  //        TrueType font has a Macintosh Roman cmap, use it, and
  //        reverse map the char names through MacRomanEncoding to
  //        get char codes.
  //    1b. If the PDF font is not symbolic or the PDF font is not
  //        embedded, and the TrueType font has a Microsoft Unicode
  //        cmap or a non-Microsoft Unicode cmap, use it, and use the
  //        Unicode indexes, not the char codes.
  //    1c. If the PDF font is symbolic and the TrueType font has a
  //        Microsoft Symbol cmap, use it, and use char codes
  //        directly (possibly with an offset of 0xf000).
  //    1d. If the TrueType font has a Macintosh Roman cmap, use it,
  //        as in case 1a.
  // 2. If the PDF font does not have an encoding or the PDF font is
  //    symbolic:
  //    2a. If the TrueType font has a Macintosh Roman cmap, use it,
  //        and use char codes directly (possibly with an offset of
  //        0xf000).
  //    2b. If the TrueType font has a Microsoft Symbol cmap, use it,
  //        and use char codes directly (possible with an offset of
  //        0xf000).
  // 3. If none of these rules apply, use the first cmap and hope for
  //    the best (this shouldn't happen).

Comment 6 Simon Shugar 2017-07-20 18:10:31 UTC

As I do not have a large amount of experience with the Poppler code-base I do not know the repercussions of this patch. Is there any test suite / test PDF library I could use to test this fix further?

Comment 7 Adrian Johnson 2017-07-21 12:33:44 UTC

The first thing I noted is PDF embeds the entire original fonts making the PDF file size unnecessarily large. The font descriptor for two of the fonts is incorrect. It lists the encoding as WinAnsiEncoding but flags = 4 (symbolic). Flags should be 32 (non-symbolic) for WinAnsiEncoding. Changing flags to 32 for Calibri-Bold and Arial-Narrow causes the PDF to display correctly with poppler. These two bugs should be reported to who ever is responsible for generating this PDF.

There is a regression suite that can be to check your patch. The patch doesn't look correct to me. The useMacRoman flag is for if the encoding if MacRomanEncoding. When flags = symbolic this means look up the character directly in the (1,0) subtable without performing any decoding. The problem in your PDF is flags is symbolic so it maps characters directly to the (1,0) table. But on a standard (non-subsetted) TTF font the (1,0) subtable is in MacRoman encoding. The characters in your PDF content are assuming WinAnsi encoding so outside of the common ASCII subset some characters are going to get messed up.

I tested with Adobe Reader and it displays correctly. I'm not sure why. Looking at the PDF32000 spec section 9.6.6.4, this case of an encoding specified and flags = symbolic is not covered. So the PDF is likely non conforming but Adobe Reader either has some additional logic to handle it or the default handling just happens to display it as intended.

Comment 8 Simon Shugar 2017-07-21 17:16:23 UTC

Thank you Adrian, you reply was detailed and helps me out alot. I'll look into the PDF generation and see where I go from there. Is there any where you would recommend that I can read up upon font encoding within PDFs?

As for embedding the entire original fonts that is by design.

Comment 9 Adrian Johnson 2017-07-21 21:56:12 UTC

(In reply to Simon Shugar from comment #8)
> Is there any where you would recommend that I can read up upon font encoding
> within PDFs?

https://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Section 9.6.6.

In particular for TrueType fonts you should read 9.6.6.4 where it starts with:

"Because some aspects of TrueType glyph selection are dependent on the
conforming reader or the operating system, PDF files that use TrueType fonts
should follow certain guidelines to ensure predictable behaviour across all
conforming readers:"

and also the note:

"Some popular TrueType font programs contain incorrect encoding information.
Implementations of TrueType font interpreters have evolved heuristics for
dealing with such problems; those heuristics are not described here. For
maximum portability, only well-formed TrueType font programs should be used
in PDF files. Therefore, a TrueType font program in a PDF file may need to be
modified to conform to these guidelines."

and:

"If a character cannot be mapped in any of the ways described previously, a
conforming reader may supply a mapping of its choosing."

When I developed the cairo font embedding I would test with Adobe Reader, Ghostscript, and poppler. If the PDF did not work with all three that was usually an indication I did something wrong.

Comment 10 Albert Astals Cid 2017-07-31 15:30:20 UTC

Interestingly mupdf also works. So maybe someone with time can have a look at what they are doing?

Comment 11 Thomas Freitag 2017-08-01 12:49:43 UTC

(In reply to Albert Astals Cid from comment #10)
> Interestingly mupdf also works. So maybe someone with time can have a look
> at what they are doing?

It's the same as I already explained in bug 101624:
mupdf displays it correctly because it uses the LAST cmap in the font file which fits, and this is also here always the unicode one!

Comment 12 Thomas Freitag 2017-08-01 12:53:55 UTC

Also ghostscript displays it correctly, and of course also acrobat. So it seems for me that gs and acrobat always use a unicode cmap if present, and on the contrary to my patch for bug 101624 also ignores the symbolic flag in this case.

Comment 13 Thomas Freitag 2017-08-01 12:59:56 UTC

Created attachment 133170 [details] [review]
Use unicode cmap if it exists

This patch solves this bug and also bug 101624. And looking at the PDF32000 spec section 9.6.6.4 I can't really decide "If a (3, 1) “cmap” subtable (Microsoft Unicode) is present" should only be done if the font doesn't specify MacRomanEncoding or always!

Comment 14 Thomas Freitag 2017-08-01 13:01:40 UTC

*** Bug 101624 has been marked as a duplicate of this bug. ***

Comment 15 Albert Astals Cid 2017-08-02 20:58:02 UTC

(In reply to Thomas Freitag from comment #13)
> Created attachment 133170 [details] [review] [review]
> Use unicode cmap if it exists
> 
> This patch solves this bug and also bug 101624. And looking at the PDF32000
> spec section 9.6.6.4 I can't really decide "If a (3, 1) “cmap” subtable
> (Microsoft Unicode) is present" should only be done if the font doesn't
> specify MacRomanEncoding or always!

So the text says

If the font has a named Encoding entry of either MacRomanEncoding or WinAnsiEncoding, or if the font descriptor’s Nonsymbolic flag (see Table 123) is set, the conforming reader shall create a table that maps from character codes to glyph names:
 * If the Encoding entry is one of the names MacRomanEncoding or WinAnsiEncoding, the table shall be initialized with the mappings described in Annex D.
 * If the Encoding entry is a dictionary, the table shall be initialized with the entries from the dictionary’s BaseEncoding entry (see Table 114). Any entries in the Differences array shall be used to update the table. Finally, any undefined entries in the table shall be filled using StandardEncoding.

If a (3, 1) “cmap” subtable (Microsoft Unicode) is present:
 * A character code shall be first mapped to a glyph name using the table described above.
 * The glyph name shall then be mapped to a Unicode value by consulting the Adobe Glyph List (see the Bibliography).
 + Finally, the Unicode value shall be mapped to a glyph description according to the (3, 1) subtable.

If no (3, 1) subtable is present but a (1, 0) subtable (Macintosh Roman) is present:
 * A character code shall be first mapped to a glyph name using the table described above.
 * The glyph name shall then be mapped back to a character code according to the standard Roman encoding used on Mac OS.
 * Finally, the code shall be mapped to a glyph description according to the (1, 0) subtable.

In any of these cases, if the glyph name cannot be mapped as specified, the glyph name shall be looked up in the font program’s “post” table (if one is present) and the associated glyph description shall be used.

The standard Roman encoding that is used on Mac OS is the same as the MacRomanEncoding described in Annex D, with the addition of 15 entries and the replacement of the currency glyph with the Euro glyph, as shown in Table 115.


********

My understanding of the first point of the "(3, 1) “cmap” subtable" section", i.e. "A character code shall be first mapped to a glyph name using the table described above." is that it should always use the table described in the first paragraph "the conforming reader shall create a table", which if there is no "MacRomanEncoding or WinAnsiEncoding" fallsback to the second point of the first paragraph "If the Encoding entry is a dictionary", no?

Comment 16 Thomas Freitag 2017-08-04 10:46:22 UTC

(In reply to Albert Astals Cid from comment #15)
> ********
> 
> My understanding of the first point of the "(3, 1) “cmap” subtable"
> section", i.e. "A character code shall be first mapped to a glyph name using
> the table described above." is that it should always use the table described
> in the first paragraph "the conforming reader shall create a table", which
> if there is no "MacRomanEncoding or WinAnsiEncoding" fallsback to the second
> point of the first paragraph "If the Encoding entry is a dictionary", no?

But this is done also with my patch: Please have a look into the code. If useMacRoman is false but useUnicode is true a table is created on line 1773ff. 
The difference between with and without the patch is just the following:
If usesMacRomanEnc is true and a macRomanCmap exists, the unicodeCmap is ignored even if it exists without the patch. With the patch the unicodeCmap is used always if it exists but not directly, it behaves then also like before if usesMacRomanEnc would be false.

Comment 17 Thomas Freitag 2017-08-04 11:09:51 UTC

But let us for a short moment ignore if we understand the spec right or not. The main question in my eyes is what the user expects when he compares the output of poppler with the output of acrobat or ghostscript:

Even if poppler follows the spec correctly and acrobat doesn't follow its own spec, You can of course try to argue that poppler behaves correctly and acrobat not. But this would be probably only accepted if the output looks more like the user expect it with poppler than with acrobat. But this is here not the case.

I also had a look at the ghostscript code today to figure out how they handle it, but I couldn't find where they handle it.

And I don't like the solution of mupdf which takes the last cmap which fits. What happens if the cmaps are in another order? Would acrobat then render it also "wrong"? Would be nice if someone could produce such a pdf, I'm not able to do that without great efforts.

My conclusion: If my patch doesn't disturb anything else, why don't use it? And if it disturbs any of the PDFs in your regression test, I offer to have a look into it again.

Comment 18 Albert Astals Cid 2017-08-09 21:02:24 UTC

Honestly, have i ever rejected a patch that improved inter-operability with Acrobat?

I don't think so, i'm juts trying to make sure we do things because we understand why we do them and not because "i just swiched a bit and suddenly looks better", because sometimes down the road those changes have problems.

I'll run the regtest.

Comment 19 Albert Astals Cid 2017-08-10 07:22:38 UTC

Pushed

Comment 20 Thomas Freitag 2017-08-10 07:41:25 UTC

(In reply to Albert Astals Cid from comment #18)
> Honestly, have i ever rejected a patch that improved inter-operability with
> Acrobat?
> 
> I don't think so, i'm juts trying to make sure we do things because we
> understand why we do them and not because "i just swiched a bit and suddenly
> looks better", because sometimes down the road those changes have problems.
> 
> I'll run the regtest.

Sorry for any misunderstanding, Albert. To clearify in other words: I never had argued for my patch if "i just switched a bit and suddenly looks better". I argued for it because I became convinced that acrobat always starts with the unicode cmap if it exists, regardless of whether the spec says that or not.

To the contrary: in all the last years when I provided patches to poppler I was happy that You are there as my counterpart who try to get to the bottom of it. This often made my patches better and when I was on the wrong track You guided me back.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.