Bug 7002 - Text extraction should expand ligatures to their normal form
Summary: Text extraction should expand ligatures to their normal form
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
Depends on:
Reported: 2006-05-23 08:05 UTC by Kristian Høgsberg
Modified: 2012-02-21 13:20 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:

Sample PDF file with ligatures (73.62 KB, application/pdf)
2006-11-10 01:47 UTC, Wouter Bolsterlee
Alternate sample (18.86 KB, application/pdf)
2006-11-11 08:31 UTC, Ed Catmur
expand ligatures to normal form (1.36 KB, patch)
2012-02-19 03:06 UTC, Adrian Johnson
Details | Splinter Review
expand ligatures in alphabetic presentation block (2.59 KB, patch)
2012-02-20 12:35 UTC, Adrian Johnson
Details | Splinter Review

Description Kristian Høgsberg 2006-05-23 08:05:57 UTC
pdftotext and copy-n-paste from a document should expand ligatures such as fi to
the letters f and i.  See bug #2929.
Comment 1 Wouter Bolsterlee 2006-11-10 01:39:01 UTC
See also http://bugzilla.gnome.org/show_bug.cgi?id=341947
Comment 2 Wouter Bolsterlee 2006-11-10 01:47:07 UTC
Created attachment 7724 [details]
Sample PDF file with ligatures
Comment 3 Ed Catmur 2006-11-11 08:31:16 UTC
Created attachment 7745 [details]
Alternate sample

The original attachment cannot work in poppler until bug 8985 and bug 8986 are
fixed. The here attached PDF is simpler to fix.
Comment 4 Ed Catmur 2006-11-20 08:06:31 UTC
Also note that attachment 7724 [details] (to comment 2) doesn't work in Adobe Reader
(7.0.8) either, so for feature parity getting attachment 7745 [details] to work is more
Comment 5 Keenan Pepper 2010-06-07 16:13:36 UTC
Isn't it about time for this to get fixed??
Comment 6 Brad Hards 2010-06-07 16:26:14 UTC

I'm sure a patch that addressed the issue would be appreciated.

Comment 7 Adrian Johnson 2012-02-19 03:06:06 UTC
Created attachment 57272 [details] [review]
expand ligatures to normal form

This patch makes the test case in comment 3 work. The test case in comment 2 has already been fixed.
Comment 8 Albert Astals Cid 2012-02-19 12:08:16 UTC
To be honest I'm not sure doing this unconditionally is a good idea. But don't know where to ask either, what do you think of having a pdftotext command line switch?

BTW there's a tab vs spacing issue in the patch
Comment 9 Adrian Johnson 2012-02-20 03:24:37 UTC
If we were to add a command line option for normalizing unicode it should normalize all of the text like findText() does, not just the characters from one code path in the glyph to unicode code.

Thinking about this again I agree it is probably not a good idea to unconditionally normalize all glyphs. But outputting "fi" style ligatures causes problems when searching the text. Maybe it would be better to only normalize glyphs in the Alphabetic Presentation Forms range: U+FB00–U+FB4F since the Unicode Consortium discourages the use of these presentation forms.

I tried the save as text function of acroread on the second test case and it expanded the ligatures.
Comment 10 Albert Astals Cid 2012-02-20 09:47:02 UTC
That'd make more sense, how diffcult is to expand only that range?
Comment 11 Adrian Johnson 2012-02-20 12:35:01 UTC
Created attachment 57357 [details] [review]
expand ligatures in alphabetic presentation block

Updated patch.
Comment 12 Albert Astals Cid 2012-02-21 13:20:48 UTC

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.