http://bugzilla.gnome.org/show_bug.cgi?id=341947 and bug 7002 have attached a
PDF where the font (Minion) has entries of the form A_a in its mapping tables.
Some of these (e.g. f_l, f_f_i) are in Unicode and would be easy to add to
nameToUnicodeTab, but others (T_h, f_f_t, f_j) are not in Unicode and can only
be represented as strings. This will probably require some rearchitecting.
Created attachment 7750 [details] [review]
Patch, also fixes bug 8985.
See also bug 9001.
Hm. That patch sucks; it can't handle ligated codepoint references. (e.g.
A_uni030B for A with U+030B COMBINING DOUBLE GRAVE ACCENT.) I'll put together a
better patch when I have time.
Also, we should probably complain to stderr when we can't understand a mapping
table character name. It's only polite.
Idea for improved handling:
Change the 2-pass flow (currently, recognised names then numeric codes, writing
into a Unicode for initialising in CharCodeToUnicode::make8BitToUnicode) to
have a preliminary pass deciding whether to use hex or decimal, and a single
main pass which reads character names, numeric codes and ligatures, by passing
components to a small function which takes the component and "hex" bool and
writes into uBuf if successful.
This will mean initialising CharCodeToUnicode::make8BitToUnicode with an empty
Unicode (because the main pass needs to use ::setMapping), but that's just
a performance hack and ::setMapping is fast for the single-character case anyway.
Hopefully the code will be clearer than the above.
Created attachment 7796 [details] [review]
Right, that's better.
Addresses comment 3 and comment 4.
The suggestion in comment 5 can't work exactly as stated, because that runs the
risk of false positives (e.g. "ae" pushing the parser into hex mode where
decimal is intended). The supplied patch avoids any change to behaviour except
in parsing ligatures, also cutting down on patch size.
The warnings to stderr should help in future diagnosing why text isn't being
copied properly and in driving future improvements to the mapping parser.
Created attachment 7867 [details] [review]
Also handles bug 9128: glyph variants.
Instead of printing to stderr it would probably be better to use the error()
function like the other users in GfxFont.cc. Using error() means that
applications like KPDF can display the error messages in the user interface.
Created attachment 8042 [details] [review]
Use error() instead of fprintf.
Created attachment 10380 [details] [review]
Hi Ed, sorry for taking to long for reacting on this, i'm not sure i understand what this patch is about, i understand A_a is something like an "Aa" ligature on the Minion font and as this "Aa" ligature does not "exist" in the Unicode standard our current code has problems rendering those ligature and your patch fixes it?
There is an Adobe document  that contains the procedure for mapping glyph names to a sequence of Unicode characters to support text extraction. The document specifies some additional forms of glyph naming that the patch does not appear to handle.
@comment 11: Yes, precisely; high-end fonts will often contain ligatures that are not present in Unicode.
@comment 12: Thanks, I didn't know about that specification. I'll work on getting the wholw of the spec into the patch.
Created attachment 13172 [details] [review]
OK, this should work.
Patch's in for next non bugfix release
Created attachment 13303 [details] [review]
Supplementary patch to fix issues discussed on-list: