pdftotext and copy-n-paste from a document should expand ligatures such as fi to
the letters f and i. See bug #2929.
See also http://bugzilla.gnome.org/show_bug.cgi?id=341947
Created attachment 7724 [details]
Sample PDF file with ligatures
Created attachment 7745 [details]
The original attachment cannot work in poppler until bug 8985 and bug 8986 are
fixed. The here attached PDF is simpler to fix.
Also note that attachment 7724 [details] (to comment 2) doesn't work in Adobe Reader
(7.0.8) either, so for feature parity getting attachment 7745 [details] to work is more
Isn't it about time for this to get fixed??
I'm sure a patch that addressed the issue would be appreciated.
Created attachment 57272 [details] [review]
expand ligatures to normal form
This patch makes the test case in comment 3 work. The test case in comment 2 has already been fixed.
To be honest I'm not sure doing this unconditionally is a good idea. But don't know where to ask either, what do you think of having a pdftotext command line switch?
BTW there's a tab vs spacing issue in the patch
If we were to add a command line option for normalizing unicode it should normalize all of the text like findText() does, not just the characters from one code path in the glyph to unicode code.
Thinking about this again I agree it is probably not a good idea to unconditionally normalize all glyphs. But outputting "fi" style ligatures causes problems when searching the text. Maybe it would be better to only normalize glyphs in the Alphabetic Presentation Forms range: U+FB00–U+FB4F since the Unicode Consortium discourages the use of these presentation forms.
I tried the save as text function of acroread on the second test case and it expanded the ligatures.
That'd make more sense, how diffcult is to expand only that range?
Created attachment 57357 [details] [review]
expand ligatures in alphabetic presentation block