http://bugzilla.gnome.org/show_bug.cgi?id=341947 and bug 7002 have attached a PDF where the font (Minion) has entries of the form A_a in its mapping tables. Some of these (e.g. f_l, f_f_i) are in Unicode and would be easy to add to nameToUnicodeTab, but others (T_h, f_f_t, f_j) are not in Unicode and can only be represented as strings. This will probably require some rearchitecting.
Created attachment 7750 [details] [review] mapping_tables.patch Patch, also fixes bug 8985.
See also bug 9001.
Hm. That patch sucks; it can't handle ligated codepoint references. (e.g. A_uni030B for A with U+030B COMBINING DOUBLE GRAVE ACCENT.) I'll put together a better patch when I have time.
Also, we should probably complain to stderr when we can't understand a mapping table character name. It's only polite.
Idea for improved handling: Change the 2-pass flow (currently, recognised names then numeric codes, writing into a Unicode[256] for initialising in CharCodeToUnicode::make8BitToUnicode) to have a preliminary pass deciding whether to use hex or decimal, and a single main pass which reads character names, numeric codes and ligatures, by passing components to a small function which takes the component and "hex" bool and writes into uBuf if successful. This will mean initialising CharCodeToUnicode::make8BitToUnicode with an empty Unicode[256] (because the main pass needs to use ::setMapping), but that's just a performance hack and ::setMapping is fast for the single-character case anyway. Hopefully the code will be clearer than the above.
Created attachment 7796 [details] [review] mapping_tables.patch Right, that's better. Addresses comment 3 and comment 4. The suggestion in comment 5 can't work exactly as stated, because that runs the risk of false positives (e.g. "ae" pushing the parser into hex mode where decimal is intended). The supplied patch avoids any change to behaviour except in parsing ligatures, also cutting down on patch size. The warnings to stderr should help in future diagnosing why text isn't being copied properly and in driving future improvements to the mapping parser.
Created attachment 7867 [details] [review] mapping_tables.patch Updated patch. Also handles bug 9128: glyph variants.
Instead of printing to stderr it would probably be better to use the error() function like the other users in GfxFont.cc. Using error() means that applications like KPDF can display the error messages in the user interface.
Created attachment 8042 [details] [review] mapping_tables.patch Use error() instead of fprintf.
Created attachment 10380 [details] [review] mapping_tables.patch for 0.5.9
Hi Ed, sorry for taking to long for reacting on this, i'm not sure i understand what this patch is about, i understand A_a is something like an "Aa" ligature on the Minion font and as this "Aa" ligature does not "exist" in the Unicode standard our current code has problems rendering those ligature and your patch fixes it?
There is an Adobe document [1] that contains the procedure for mapping glyph names to a sequence of Unicode characters to support text extraction. The document specifies some additional forms of glyph naming that the patch does not appear to handle. [1] http://www.adobe.com/devnet/opentype/archives/glyph.html
@comment 11: Yes, precisely; high-end fonts will often contain ligatures that are not present in Unicode. @comment 12: Thanks, I didn't know about that specification. I'll work on getting the wholw of the spec into the patch.
Great :-)
Created attachment 13172 [details] [review] mapping_tables.patch OK, this should work.
Patch's in for next non bugfix release
Created attachment 13303 [details] [review] mapping_tables_r2.patch Supplementary patch to fix issues discussed on-list: http://lists.freedesktop.org/archives/poppler/2007-December/003236.html http://lists.freedesktop.org/archives/poppler/2007-December/003238.html
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.