7002 – Text extraction should expand ligatures to their normal form

Bug 7002 - Text extraction should expand ligatures to their normal form

Summary: Text extraction should expand ligatures to their normal form

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-05-23 08:05 UTC by Kristian Høgsberg
Modified:	2012-02-21 13:20 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Sample PDF file with ligatures (73.62 KB, application/pdf) 2006-11-10 01:47 UTC, Wouter Bolsterlee	Details
Alternate sample (18.86 KB, application/pdf) 2006-11-11 08:31 UTC, Ed Catmur	Details
expand ligatures to normal form (1.36 KB, patch) 2012-02-19 03:06 UTC, Adrian Johnson	Details \| Splinter Review
expand ligatures in alphabetic presentation block (2.59 KB, patch) 2012-02-20 12:35 UTC, Adrian Johnson	Details \| Splinter Review
Show Obsolete (1) View All

Description Kristian Høgsberg 2006-05-23 08:05:57 UTC

pdftotext and copy-n-paste from a document should expand ligatures such as fi to
the letters f and i.  See bug #2929.

Comment 1 Wouter Bolsterlee 2006-11-10 01:39:01 UTC

See also http://bugzilla.gnome.org/show_bug.cgi?id=341947

Comment 2 Wouter Bolsterlee 2006-11-10 01:47:07 UTC

Created attachment 7724 [details]
Sample PDF file with ligatures

Comment 3 Ed Catmur 2006-11-11 08:31:16 UTC

Created attachment 7745 [details]
Alternate sample

The original attachment cannot work in poppler until bug 8985 and bug 8986 are
fixed. The here attached PDF is simpler to fix.

Comment 4 Ed Catmur 2006-11-20 08:06:31 UTC

Also note that attachment 7724 [details] (to comment 2) doesn't work in Adobe Reader
(7.0.8) either, so for feature parity getting attachment 7745 [details] to work is more
urgent.

Comment 5 Keenan Pepper 2010-06-07 16:13:36 UTC

Isn't it about time for this to get fixed??

Comment 6 Brad Hards 2010-06-07 16:26:14 UTC

Keenan,

I'm sure a patch that addressed the issue would be appreciated.

Brad

Comment 7 Adrian Johnson 2012-02-19 03:06:06 UTC

Created attachment 57272 [details] [review]
expand ligatures to normal form

This patch makes the test case in comment 3 work. The test case in comment 2 has already been fixed.

Comment 8 Albert Astals Cid 2012-02-19 12:08:16 UTC

To be honest I'm not sure doing this unconditionally is a good idea. But don't know where to ask either, what do you think of having a pdftotext command line switch?

BTW there's a tab vs spacing issue in the patch

Comment 9 Adrian Johnson 2012-02-20 03:24:37 UTC

If we were to add a command line option for normalizing unicode it should normalize all of the text like findText() does, not just the characters from one code path in the glyph to unicode code.

Thinking about this again I agree it is probably not a good idea to unconditionally normalize all glyphs. But outputting "fi" style ligatures causes problems when searching the text. Maybe it would be better to only normalize glyphs in the Alphabetic Presentation Forms range: U+FB00–U+FB4F since the Unicode Consortium discourages the use of these presentation forms.

I tried the save as text function of acroread on the second test case and it expanded the ligatures.

Comment 10 Albert Astals Cid 2012-02-20 09:47:02 UTC

That'd make more sense, how diffcult is to expand only that range?

Comment 11 Adrian Johnson 2012-02-20 12:35:01 UTC

Created attachment 57357 [details] [review]
expand ligatures in alphabetic presentation block

Updated patch.

Comment 12 Albert Astals Cid 2012-02-21 13:20:48 UTC

Commited

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.