Summary: | Incorrect extaction of /ToUnicode CMaps for ligatures. | ||
---|---|---|---|
Product: | poppler | Reporter: | Vasile Gaburici <gaburici> |
Component: | general | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Test document.
Acrobat text extraction (search results). Fixes the CMap ligature search bug. Fixes the CMap ligature search bug. For debugging only. Dot not commit! |
Created attachment 18541 [details]
Acrobat text extraction (search results).
pdftotext (and evince) extract the text as "This A.", which is incorrect.
I found the bug. The ligatures are being set twice in the sMap array. Once by poppler's built-in "smart" algorithm, and then by the CMap. According to the comments in GfxFont.cc, the CMap should take precedence. But it doesn't, not for ligatures! The bug is that CMap ligatures get added at the end of the sMap array, but lookup happens linearly from the front! E.g. the Th ligature (char 0) is first set to 00540068 (Th) at index 0 in sMap by the built-in algorithm, and then to 00410068 at index 1 by the CMap. But lookup returns the entry at index 0. The fix is to scan the sMap backwards. $ pdftotext liga-cmap-bug.pdf Setting @0 00[0] -> 0054 Setting @0 00[1] -> 0068 Adding @1 00: 00410068 + 0 Adding @2 02: 00660066006A + 0 Adding @3 03: 01620068 + 0 Adding @4 0B: 00660066 + 0 Adding @5 0C: 00660069 + 0 Adding @6 0D: 0066006C + 0 Adding @7 0E: 006600660069 + 0 Adding @8 0F: 00660066006C + 0 Adding @9 9C: 0049004A + 0 Adding @10 A0: 0066006A + 0 Adding @11 BC: 0069006A + 0 Returning @0 00[0] -> 0054 Returning @0 00[1] -> 0068 Created attachment 18546 [details] [review] Fixes the CMap ligature search bug. I fixed it. Two patches coming up; one is the one line fix. The other is the debugging patch in case you want to validate this with other pdfs before committing. $ pdftotext liga-cmap-bug.pdf Setting @0 00[0] -> 0054 Setting @0 00[1] -> 0068 Adding @1 00: 00410068 + 0 Adding @2 02: 00660066006A + 0 Adding @3 03: 01620068 + 0 Adding @4 0B: 00660066 + 0 Adding @5 0C: 00660069 + 0 Adding @6 0D: 0066006C + 0 Adding @7 0E: 006600660069 + 0 Adding @8 0F: 00660066006C + 0 Adding @9 9C: 0049004A + 0 Adding @10 A0: 0066006A + 0 Adding @11 BC: 0069006A + 0 Returning @1 00[0] -> 0041 Returning @1 00[1] -> 0068 $ cat liga-cmap-bug.txt Ahis A. Created attachment 18547 [details] [review] Fixes the CMap ligature search bug. I fixed it. Two patches coming up; one is the one line fix. The other is the debugging patch in case you want to validate this with other pdfs before committing. $ pdftotext liga-cmap-bug.pdf Setting @0 00[0] -> 0054 Setting @0 00[1] -> 0068 Adding @1 00: 00410068 + 0 Adding @2 02: 00660066006A + 0 Adding @3 03: 01620068 + 0 Adding @4 0B: 00660066 + 0 Adding @5 0C: 00660069 + 0 Adding @6 0D: 0066006C + 0 Adding @7 0E: 006600660069 + 0 Adding @8 0F: 00660066006C + 0 Adding @9 9C: 0049004A + 0 Adding @10 A0: 0066006A + 0 Adding @11 BC: 0069006A + 0 Returning @1 00[0] -> 0041 Returning @1 00[1] -> 0068 $ cat liga-cmap-bug.txt Ahis A. Created attachment 18548 [details] [review] For debugging only. Dot not commit! Here's the debugging patch, if you need it... Thanks for the patch, just for the record, do you allow to relicense it under GPL2 or later? (In reply to comment #6) > Thanks for the patch, just for the record, do you allow to relicense it under > GPL2 or later? > Yes. Will be released with poppler 0.9.0, thanks for the patch and sorry for the late answer, keep patches coming. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 18540 [details] Test document. Attached is a PDF with a CMap that mapps T to A and the ligature Th to Ah (as separate characters). If you search it with Acrobat for "A", you find, as you'd expect, two As. If you search it with evince, or extract the text with pdftotext (from poppler-utils), you only the 2nd CMapped A, the first one is extracted, incorrectly, as T. This example is contrived for the sake of keeping it simple and restricted to English letters, but there are good reasons to want ligatures in CMap work properly in poppler, as they do in Acrobat.