Bug 19487 - Not all the relevant documents are followed when make a ToUnicode map for PDF
Summary: Not all the relevant documents are followed when make a ToUnicode map for PDF
Status: RESOLVED NOTABUG
Alias: None
Product: cairo
Classification: Unclassified
Component: pdf backend (show other bugs)
Version: 1.8.6
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Adrian Johnson
QA Contact: cairo-bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-09 13:53 UTC by Barry Schwartz
Modified: 2015-10-18 05:10 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
a example in lua (4.79 KB, application/octet-stream)
2009-01-10 01:17 UTC, Barry Schwartz
Details
Python code (a fontforge script) that I use to make ToUnicode cmaps (123.50 KB, application/octet-stream)
2009-01-10 01:21 UTC, Barry Schwartz
Details

Description Barry Schwartz 2009-01-09 13:53:39 UTC
Cairo should follow the standard glyph-name rules from http://www.adobe.com/devnet/opentype/archives/glyph.html when generating ToUnicode maps. Otherwise PDF readers cannot handle alternate glyphs and ligatures in OpenType fonts.

This is very important for PDF e-books, because otherwise "Find" or "Search" doesn't work with ligatures, small caps, lining or oldstyle figures, etc.

(One might also, optionally, wish to map older glyph names similarly to their newer equivalents, such as treating "Asmall" the same as "a.sc".)

Note that, although there are a few ligatures that do have their own code points (ff, fi, fl, etc.) in newer OpenType fonts (I think including all from Adobe) these often aren't used, and instead ligatures f_f, f_i, f_l, etc., are treated like any other ligatures.
Comment 1 Adrian Johnson 2009-01-09 23:25:37 UTC
It was not clear to me in your bug report whether you are asking for the PDF backend to:

a) embed the Adobe glyph names in the PDF file to facilitate text searching, or

b) parse the Adobe glyph names in the fonts to obtain the unicode to glyph mapping and use this information to create the ToUnicode map.

In the case of a), we don't use glyph names in the PDF file. The ToUnicode maps each glyph to one or more unicode characters using the hexadecimal numbers for the glyph index and unicode character(s).

In the case of b), the cairo_show_text_glyphs() API is the preferred means of providing the unicode to glyph mapping to the PDF backend. This information is used to generate the ToUnicode entries for the one glyph to one or more unicode character mappings while ActualText is used for the many to many mappings.

As a fallback for the case where cairo_show_glyphs() is used, cairo does a reverse lookup of the cmap in the font. However this only works for 1 to 1 mappings. I don't think extending the current fallback method to parsing the Adobe glyph names, including a complete list of glyph names in cairo, and dealing with the various non Adobe Glyph Naming convention compliant names in order to extract the n to 1 ligatures would be worth the effort required.

I do want cairo to be able to easily generate searchable PDF files. If you can provide any more relevant information such as what application you are using, what PDF viewer you are using and a sample PDF file we can try to help you resolve this issue.
Comment 2 Barry Schwartz 2009-01-10 00:16:27 UTC
I am asking for (b). The PDF reader is Adobe Reader 8 (or 9 if I run it under Wine), and using OpenType Latin fonts. There is no specific application; I was trying out Cairo to decide whether to use it myself; I make e-books and want searches to work. Also I occasionally make fonts and do trouble myself to name all the glyphs according to the rules at http://www.adobe.com/devnet/opentype/archives/glyph.html and with Cairo all that care goes to waste.

If the algorithm at the web page are followed, then regardless of all OpenType substitutions, no matter how intricate, Reader, and every other application I have used (okular, evince pdftotext, etc.), can search and extract text. It's like magic.

Looking over http://www.adobe.com/devnet/acrobat/pdfs/5411.ToUnicode.pdf it appears to me that this is, implicitly, a recommendation for ToUnicode maps of CJK fonts.

See also bullet point one in the "Extraction of Text Content" section (sect. 5.9) of the PDF Reference. No ActualText is needed, at least for recent Latin-Greek-Cyrillic OpenType fonts.
Comment 3 Barry Schwartz 2009-01-10 01:17:48 UTC
Created attachment 21858 [details]
a example in lua
Comment 4 Barry Schwartz 2009-01-10 01:21:25 UTC
Created attachment 21859 [details]
Python code (a fontforge script) that I use to make ToUnicode cmaps

This example isn't complete: it doesn't handle "u000000" glyph names and doesn't use ranges to optimize the ToUnicode (unless I added that ability and don't remember doing it!).
Comment 5 Barry Schwartz 2009-01-10 01:23:32 UTC
Sorry for making a mess of the Bugzilla. I don't use it very often.

I've attached some Lua code and Python code examples. The Python code (a fontforge script) doesn't handle the "u000000" cases.
Comment 6 Adrian Johnson 2009-01-10 02:08:34 UTC
Are you calling cairo_show_text() or cairo_show_glyphs()? Is there any reason you can not use cairo_show_text_glyphs()?

I am still not convinced that cairo should parse all the glyph names. The list of glyph names in your fontforge script is over 100KB. That is a huge amount of bloat to add to cairo just for applications that do not want to use cairo_show_text_glyphs().
Comment 7 Barry Schwartz 2009-01-10 02:31:51 UTC
(In reply to comment #6)
> Are you calling cairo_show_text() or cairo_show_glyphs()? Is there any reason
> you can not use cairo_show_text_glyphs()?
> 
> I am still not convinced that cairo should parse all the glyph names. The list
> of glyph names in your fontforge script is over 100KB. That is a huge amount of
> bloat to add to cairo just for applications that do not want to use
> cairo_show_text_glyphs().
> 

Then what about the ability for an application to supply a ToUnicode of its own? This wouldn't cause any bloat in the library; however, application authors would have to do more work (though likely much less work than with ActualText).

Also the table could be converted to a trie and maybe wouldn't take up nearly so much space. In C code that would make a lot of sense. It's how TeX compresses hyphenation dictionaries.
Comment 8 Adrian Johnson 2009-01-10 03:06:32 UTC
(In reply to comment #7)
> Then what about the ability for an application to supply a ToUnicode of its
> own? This wouldn't cause any bloat in the library; however, application authors
> would have to do more work (though likely much less work than with ActualText).

That's what the cairo_show_text_glyphs() API function does.
Comment 9 Barry Schwartz 2009-01-11 02:20:02 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Then what about the ability for an application to supply a ToUnicode of its
> > own? This wouldn't cause any bloat in the library; however, application authors
> > would have to do more work (though likely much less work than with ActualText).
> 
> That's what the cairo_show_text_glyphs() API function does.
> 


I've fitted a lookup trie into a string 126023 bytes long; for the price of those 126023 bytes, plus a few exception cases, and some tiny bits of C code yet to be written, one can have smaller PDFs, considerably simpler application code, and text extraction for all applications when using modern LGC fonts.

Admittedly, most Cairo apps aren't going to use it for typesetting efficient e-books.

Alright, well, I can patch my own or use something else. Thanks for considering.
Comment 10 Chris Wilson 2009-01-11 02:47:07 UTC
(In reply to comment #9)
> I've fitted a lookup trie into a string 126023 bytes long; for the price of
> those 126023 bytes, plus a few exception cases, and some tiny bits of C code
> yet to be written, one can have smaller PDFs, considerably simpler application
> code, and text extraction for all applications when using modern LGC fonts.

That's interesting.

From Cairo's point of view, is that we need to keep the library as simple as possible, but no simpler. That is we must provide enough flexibility for you to create the PDF you want, whilst keeping the API sane and minimal. (Admittedly Cairo is not there yet, but Adrian has grand plans. :-)

So the common response has been (for the other backends) Cairo provides the primitives, and if you have something useful that builds on top of those primitives then that should go into a cairo-[pdf-]utils. For the same reasons Cairo has strongly resisted becoming an image library, a geometry library, a matrix library etc.

So please do write up the routines you have suggested and propose them to the mailing list - that will open up the discussion beyond the few that read bugzilla.
Comment 11 Barry Schwartz 2009-01-12 02:02:17 UTC
(In reply to comment #10)
> So please do write up the routines you have suggested and propose them to the
> mailing list - that will open up the discussion beyond the few that read
> bugzilla.

I might write a patch and mail it to the list for people to do with as they wish. Being disabled and having to ration my work I might not do any followup, though.

BTW I have the lookup trie down to a 51300-byte string, which I _think_ is actually within the latest standard max object size for C programs. :) I'm not sure how to get it compiled into a program, though. In any case maybe they'll find a use for this in ConTeXt/LuaTeX.
Comment 12 Behdad Esfahbod 2009-01-12 03:54:40 UTC
Barry, you're simply ignoring what Adrian's been saying.  cairo_show_text_glyphs() is the right API to use and *that* works like magic (considering that pango automatically uses it).  Hacks based on glyph names are hacks, and don't work for most complext scripts.  Using the word magic to describe something about the Latin script is in fact very funny.  You should try Indic...
Comment 13 Barry Schwartz 2009-01-13 12:01:34 UTC
(In reply to comment #12)
> Barry, you're simply ignoring what Adrian's been saying. 
> cairo_show_text_glyphs() is the right API to use and *that* works like magic
> (considering that pango automatically uses it).  Hacks based on glyph names are
> hacks, and don't work for most complext scripts.  Using the word magic to
> describe something about the Latin script is in fact very funny.  You should
> try Indic...
> 

I’ve been very careful to restrict what I say to Latin and maybe Greek and Cyrillic; I haven't had any involvement with scripts other than Latin and Greek since Hebrew school, in a time when ‘computers’ were made of small stones and depressions in the earth. But it is true anyway that Adrian is right and probably ought to avoid putting in this special hack. Really what I’m frustrated about is that the free software applications that are out there do a miserable job with LGC OpenType fonts, but recruiting back-end writers to second-guess application authors isn't fair.

(It hurts especially when the fonts being handled miserably are one’s own.)
Comment 14 Behdad Esfahbod 2009-01-14 11:13:38 UTC
It's the app's fault for not using a real text rendering engine.  Pango is one, Qt has another, ICU too.
Comment 15 Barry Schwartz 2009-01-14 12:16:28 UTC
(In reply to comment #14)
> It's the app's fault for not using a real text rendering engine.  Pango is one,
> Qt has another, ICU too.
> 

Well, I think XeTeX uses ICU, and XeTeX does a good job, though it is too much of a black box for my taste. I think the LuaTeX approach is more ideal.
Comment 16 Adrian Johnson 2015-10-18 05:10:39 UTC
As mentioned in the comments, cairo_show_text_glyphs() is the function to use for populating the ToiUnicode map.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.