Bug 17321 - Incorrect extaction of /ToUnicode CMaps for ligatures.
Summary: Incorrect extaction of /ToUnicode CMaps for ligatures.
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-27 02:47 UTC by Vasile Gaburici
Modified: 2008-08-30 03:44 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Test document. (29.53 KB, application/pdf)
2008-08-27 02:47 UTC, Vasile Gaburici
Details
Acrobat text extraction (search results). (7.91 KB, image/png)
2008-08-27 02:50 UTC, Vasile Gaburici
Details
Fixes the CMap ligature search bug. (587 bytes, patch)
2008-08-27 07:49 UTC, Vasile Gaburici
Details | Splinter Review
Fixes the CMap ligature search bug. (587 bytes, patch)
2008-08-27 07:50 UTC, Vasile Gaburici
Details | Splinter Review
For debugging only. Dot not commit! (1.02 KB, patch)
2008-08-27 07:52 UTC, Vasile Gaburici
Details | Splinter Review

Description Vasile Gaburici 2008-08-27 02:47:52 UTC
Created attachment 18540 [details]
Test document.

Attached is a PDF with a CMap that mapps T to A and the ligature Th to Ah (as separate characters). If you search it with Acrobat for "A", you find, as you'd expect, two As. If you search it with evince, or extract the text with pdftotext (from poppler-utils), you only the 2nd CMapped A, the first one is extracted, incorrectly, as T.

This example is contrived for the sake of keeping it simple and restricted to English letters, but there are good reasons to want ligatures in CMap work properly in poppler, as they do in Acrobat.
Comment 1 Vasile Gaburici 2008-08-27 02:50:27 UTC
Created attachment 18541 [details]
Acrobat text extraction (search results).

pdftotext (and evince) extract the text as "This A.", which is incorrect.
Comment 2 Vasile Gaburici 2008-08-27 07:16:40 UTC
I found the bug. The ligatures are being set twice in the sMap array. Once by poppler's built-in "smart" algorithm, and then by the CMap. According to the comments in GfxFont.cc, the CMap should take precedence. But it doesn't, not for ligatures! The bug is that CMap ligatures get added at the end of the sMap array, but lookup happens linearly from the front! E.g. the Th ligature (char 0) is first set to 00540068 (Th) at index 0 in sMap by the built-in algorithm, and then to 00410068 at index 1 by the CMap. But lookup returns the entry at index 0. The fix is to scan the sMap backwards.

$ pdftotext liga-cmap-bug.pdf
Setting @0 00[0] -> 0054
Setting @0 00[1] -> 0068
Adding @1 00: 00410068 + 0
Adding @2 02: 00660066006A + 0
Adding @3 03: 01620068 + 0
Adding @4 0B: 00660066 + 0
Adding @5 0C: 00660069 + 0
Adding @6 0D: 0066006C + 0
Adding @7 0E: 006600660069 + 0
Adding @8 0F: 00660066006C + 0
Adding @9 9C: 0049004A + 0
Adding @10 A0: 0066006A + 0
Adding @11 BC: 0069006A + 0
Returning @0 00[0] -> 0054
Returning @0 00[1] -> 0068
Comment 3 Vasile Gaburici 2008-08-27 07:49:02 UTC
Created attachment 18546 [details] [review]
Fixes the CMap ligature search bug.

I fixed it. Two patches coming up; one is the one line fix. The other is the debugging patch in case you want to validate this with other pdfs before committing.

$ pdftotext liga-cmap-bug.pdf
Setting @0 00[0] -> 0054
Setting @0 00[1] -> 0068
Adding @1 00: 00410068 + 0
Adding @2 02: 00660066006A + 0
Adding @3 03: 01620068 + 0
Adding @4 0B: 00660066 + 0
Adding @5 0C: 00660069 + 0
Adding @6 0D: 0066006C + 0
Adding @7 0E: 006600660069 + 0
Adding @8 0F: 00660066006C + 0
Adding @9 9C: 0049004A + 0
Adding @10 A0: 0066006A + 0
Adding @11 BC: 0069006A + 0
Returning @1 00[0] -> 0041
Returning @1 00[1] -> 0068

$ cat liga-cmap-bug.txt 
Ahis A.
Comment 4 Vasile Gaburici 2008-08-27 07:50:27 UTC
Created attachment 18547 [details] [review]
Fixes the CMap ligature search bug.

I fixed it. Two patches coming up; one is the one line fix. The other is the debugging patch in case you want to validate this with other pdfs before committing.

$ pdftotext liga-cmap-bug.pdf
Setting @0 00[0] -> 0054
Setting @0 00[1] -> 0068
Adding @1 00: 00410068 + 0
Adding @2 02: 00660066006A + 0
Adding @3 03: 01620068 + 0
Adding @4 0B: 00660066 + 0
Adding @5 0C: 00660069 + 0
Adding @6 0D: 0066006C + 0
Adding @7 0E: 006600660069 + 0
Adding @8 0F: 00660066006C + 0
Adding @9 9C: 0049004A + 0
Adding @10 A0: 0066006A + 0
Adding @11 BC: 0069006A + 0
Returning @1 00[0] -> 0041
Returning @1 00[1] -> 0068

$ cat liga-cmap-bug.txt 
Ahis A.
Comment 5 Vasile Gaburici 2008-08-27 07:52:12 UTC
Created attachment 18548 [details] [review]
For debugging only. Dot not commit!

Here's the debugging patch, if you need it...
Comment 6 Albert Astals Cid 2008-08-29 11:41:05 UTC
Thanks for the patch, just for the record, do you allow to relicense it under GPL2 or later?
Comment 7 Vasile Gaburici 2008-08-29 15:34:57 UTC
(In reply to comment #6)
> Thanks for the patch, just for the record, do you allow to relicense it under
> GPL2 or later?
> 

Yes.
Comment 8 Albert Astals Cid 2008-08-30 03:44:58 UTC
Will be released with poppler 0.9.0, thanks for the patch and sorry for the late answer, keep patches coming.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.