Bug 34522 - mapping problem with ligatures and accents
Summary: mapping problem with ligatures and accents
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-21 01:47 UTC by Werner Lemberg
Modified: 2011-02-24 10:45 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
encoding fix (445 bytes, patch)
2011-02-23 03:42 UTC, Adrian Johnson
Details | Splinter Review

Description Werner Lemberg 2011-02-21 01:47:10 UTC
[self-compiled poppler 0.16.2]


Please have a look at this document:

  http://diwww.epfl.ch/w3lsp/publications/typography/frsa.pdf

Ligatures like `fi' or `fl' are mapped incorrectly to `Æ' or `Ø', and
accents like the acute over `e' is mapped to `¬'.

BTW, I don't mention the horrible spacing and kerning :-)

acroread displays this document just fine.
Comment 1 Albert Astals Cid 2011-02-21 11:24:34 UTC
Uses a font that is not in your system (Optima) and is not embedded in the PDF. Nothing we can do, either install your font on the system or configure a proper font substitution in fontconfig.
Comment 2 Werner Lemberg 2011-02-21 13:58:16 UTC
Hmm.  Then how do you explain that acroread gets it right?  For display, it replaces Optima with Adobe Sans MM on my GNU/Linux box.

I can imagine that Adobe maintains a database of common fonts and how they should be substituted...  At least this would be a solution to this problem.
Comment 3 Albert Astals Cid 2011-02-21 16:33:14 UTC
Yes, Adobe maintains a database of substitutions, in linux the tool for that is called fontconfig (i guess you already knew that ;-)) if you set it correctly and it should work (reopen it if it does not work).

It is out of scope for us in poppler maintaining that database matches.
Comment 4 Werner Lemberg 2011-02-21 22:29:08 UTC
Sorry to say, but your solution is not acceptable for Joe User.  Fiddling with fontconfig is extremely difficult since there are no GUI programs (that I'm aware of) which allow the necessary manipulations needed to resolve the problem.

Another complication is that the frequently used Optima fonts are not freely available, and the URWClassico clone isn't either, as far as I know.

On the other hand, your solution works which surprises me:

  <alias binding="same">
    <family>Optima</family>
    <accept>
      <family>URWClassico</family>
    </accept>
  </alias>

I've expected that the ligature issues remain, but obviously URWClassico has the right encoding vector.  The same is true if I use, say, `Century Schoolbook L' as a replacement.

However, it fails if I try a TrueType font like `Liberation Sans'.  How shall Joe User know this?  For me, this is indeed a very good reason to maintain a database for resolving cmap issues – fontconfig can't do this job for poppler.
Comment 5 Brad Hards 2011-02-22 00:43:55 UTC
This is clearly the responsibility of fontconfig, and poppler can't (in general) solve the problem, because it depends on an arbitrary installed set of fonts on each machine.

If there is missing tools for fontconfig, then that is a bug in fontconfig, not in poppler.
Comment 6 Werner Lemberg 2011-02-22 01:17:34 UTC
Hmm, hmm, I need to be persistent here, since we are probably miscommunicating.

Are you sure that this is the responsibility of fontconfig?  I doubt it.
It seems that poppler directly accesses the encoding vector of a Type 1 font instead of relying proper cmap access as given with, say, FreeType.

Note that the problem is not a general font selection issue but how fonts get processed – poppler apparently  accesses Type 1 fonts and TTFs differently, otherwise there wouldn't be a problem if I substitute the Type 1 font with a TTF.  And THIS is definitely not a fontconfig issue.

Am I missing something?

BTW, I won't play the game with resetting the bug status from `resolved' to `reopened'. :-)
Comment 7 Albert Astals Cid 2011-02-22 11:23:42 UTC
Let's start with the fact that PDF with non standard non embedded fonts are by definition non interoperable.

As far as i know this is what we do, we know we have to render glyph 3 of a font called Optima, so ask fontconfig for it and then go and render glyph 3. If what we got is not Optima, well it might work or it might not.
Comment 8 Werner Lemberg 2011-02-22 12:14:37 UTC
My knowledge of PDF internals is small: Do you access fonts in SFNT format by index also, bypassing the cmap?
Comment 9 Werner Lemberg 2011-02-22 12:18:00 UTC
(In reply to comment #8)
> My knowledge of PDF internals is small: Do you access fonts in SFNT format by
> index also, bypassing the cmap?

I was imprecise: How does the PCF specification mandate glyph access in SFNT fonts?
Comment 10 Albert Astals Cid 2011-02-22 12:21:03 UTC
Not that i really know much either :-D
As far as i know we don't bypass the cmap, but we use the cmap specified by the pdf file not by the font.
Comment 11 Adrian Johnson 2011-02-22 13:46:33 UTC
(In reply to comment #9)
> I was imprecise: How does the PCF specification mandate glyph access in SFNT
> fonts?

You can read all about it in Section 9.6.6 of the PDF Reference [1].

Character encoding is the worst part of the PDF standard, probably because it is something the started with only supporting Type 1 fonts in PDF 1.0 and has been extended in each revision to support additional font types while maintaining backwards compatibility.

In the frsa.pdf file it is using text strings with 8-bit characters. [A-Za-z0-9] are all encoded with the standard ASCII values. The character used for the fi ligature is encoded as character 174. The font dictionary for the font used by most of the text (including the fi ligature) specifies a Type 1 font named "Optima". The font dictionary overrides the font encoding in the font with a custom encoding. In PDF, Type 1 font encodings are keyed by glyph name. The dictionary maps character 174 to glyph name "/AE". Since this font is not embedded in the PDF this is all the information the PDF viewer has to work with.

[1] http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/adobe_supplement_iso32000.pdf
Comment 12 Werner Lemberg 2011-02-22 21:22:11 UTC
Thanks.  Would it be possible to make poppler try to substitute a Type 1 font with another Type 1 font, and a TTF with a TTF?  As my tests have shown, this can improve the result.
Comment 13 Adrian Johnson 2011-02-23 03:42:41 UTC
Created attachment 43699 [details] [review]
encoding fix

I had another look at the PDF file to see why it works when the substitute font is Type 1 but not when it is TrueType. There are two non-embedded fonts in the PDF named "Optima". One has a modified encoding while the other does not. I was looking at the wrong font. The PDF file does correctly map the character used for the fi ligature to the glyph name "/fi".

The problem is a bug in poppler. When Optima is substituted with a TrueType font, Gfx8BitFont::getCodeToGIDMap() is called to get the mapping from the PDF character codes to the font GID. This function assumes that the font in the PDF is always a TrueType font. When the PDF font is a Type 1 font it needs to go through the glyph names in the enc array to map them to the TrueType GID numbers. It is not doing this when no /Encoding is defined in the Type 1 font dictionary.

The attached patch seems to fix this. I have not done much testing with it or done enough analysis of the code to be confident that this work for all cases without causing regressions. But so far it has worked on the PDF files I have tested the patch with.
Comment 14 Werner Lemberg 2011-02-23 08:38:52 UTC
Glad to know that my persistence has helped identify a bug :-)

Thanks for working on this!
Comment 15 Albert Astals Cid 2011-02-24 10:45:03 UTC
Will be in poppler 0.16.3


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.