Bug 91286

Summary: [patch] to fix Syntax Warning: Could not parse ligature component "BE" of "S_BE" in parseCharName
Product: poppler Reporter: William Bader <williambader>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: cwolfe, williambader
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: patch to add three glyph name mappings
Sample PDF that gets Could not parse ligature component "BE" of "S_BE" in parseCharName

Description William Bader 2015-07-09 21:11:15 UTC
Created attachment 117019 [details] [review]
patch to add three glyph name mappings

I have a PDF that I can post privately that gets the warning
  Syntax Warning: Could not parse ligature component "BE" of "S_BE" in parseCharName
multiple times when processed by poppler.
pdfinfo shows that the "Producer" is "Scribus PDF Library 1.4.3.svn".

I think that underscores are delimiters for splitting glyph names into components, but the "liberation-fonts" package has a few glyph names that contain underscores.  I found a comment on a bug report that I have included below.

The attached patch adds mappings for those glyphs.

I am using Fedora 20 Linux on x86_64. Neither gs 9.16 or acroread 9.5.5 show warnings when they open the PDF.

William

https://bugzilla.redhat.com/show_bug.cgi?id=1009650
"The second bug concerns the serbian be, pe and te.  They are named S_BE, S_PE and S_TE. Applications that rely on AGL-conformant glyph names to identify the original codepoints read these names as ligatures of a letter S and a letter BE (or PE or TE). As no BE, PE or TE named glyphs exist in the font, no mapping can be found and copying will fail.  To fix this, the glyphs should be renamed to uni0431.srb, uni043F.srb and uni0442.srb (or any other suffix)."

https://github.com/adobe-type-tools/agl-specification
Adobe Glyph List Specification
Comment 1 Albert Astals Cid 2015-07-09 21:23:18 UTC
Can we have that pdf?
Comment 2 William Bader 2015-07-10 14:48:57 UTC
Created attachment 117031 [details]
Sample PDF that gets Could not parse ligature component "BE" of "S_BE" in parseCharName

atril, pdftops, pdftoppm, pdftocairo and others show the diagnostic below multiple times when processing this file.

Syntax Warning: Could not parse ligature component "BE" of "S_BE" in parseCharName
Comment 3 William Bader 2015-07-10 14:53:06 UTC
The PDF was made with Scribus 1.4.2.dfsg+r18267-1ubuntu2 on Ubuntu 14.04 LTS.
Comment 4 Albert Astals Cid 2015-07-10 20:14:29 UTC
What's the patch useful for?
Comment 5 William Bader 2015-07-10 20:51:16 UTC
1) The patch stops a stream of cryptic warnings from poppler utilities and from viewers like evince that use libpoppler when opening a PDF that uses a font package included in the LTS release of a major Linux distribution.

2) Presumably the patch makes those three glyph names accessible.

Acrobat Reader does not complain about the PDF.

The question is whether you want poppler to be lenient or strict about processing glyph names that are valid but don't conform to naming recommendations.

William
Comment 6 Albert Astals Cid 2015-09-01 22:42:15 UTC
What do you mean by " makes those three glyph names accessible." ?
Comment 7 William Bader 2015-09-02 00:10:41 UTC
poppler follows a convention (which is not a required rule) of splitting glyph names at underscores to handle ligatures, see parseCharName() in GfxFont.cc.

  // Step 2: split the remaining string into a sequence of components, using
  // underscore (U+005F LOW LINE) as the delimiter.
  if (ligatures && strchr(charName, '_')) {
    // parse names of the form A_a (e.g. f_i, T_h, l_quotesingle)


If a glyph name with an embedded underscore is not in nameToUnicodeTextTab[], poppler will split it at the underscore and won't be able to find it.

When poppler-based applications run on the attached PDF (which has a glyph named "S_BE"), they display

Syntax Warning: Could not parse ligature component "BE" of "S_BE" in parseCharName

because parseCharName() thinks that "S_BE" is an "S" plus a ligature called "BE", and there is no ligature called "BE" because the glyph is named "S_BE" with the embedded underscore.

When parseCharName() prints that syntax warning, it has failed to parse the glyph name, and it has placed either nothing or the wrong value in uBuf[].

That is what I meant that "S_BE" is inaccessible.

The comment for parseCharName() says

// This function is in part a derived work of the Adobe Glyph Mapping
// Convention: http://www.adobe.com/devnet/opentype/archives/glyph.html
// Algorithmic comments are excerpted from that document to aid
// maintainability.

but Acrobat displays the attached file without showing error messages, so Adobe's document (the basis of the code in parseCharName()) does not fully describe how Acrobat works.

Very few fonts have glyph names with embedded underscores because it violates Adobe's recommendations.  The attached patch includes three of them that are relatively widespread.
Comment 8 GitLab Migration User 2018-08-20 21:52:24 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/102.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.