Bug 38456 - Handling of small caps typographic variants
Summary: Handling of small caps typographic variants
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: All All
: medium enhancement
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
: 72753 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-06-18 13:20 UTC by thomas
Modified: 2015-04-17 23:31 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
PDF which illustrates the issue with text in small caps (19.52 KB, application/pdf)
2011-06-18 13:20 UTC, thomas
Details
Don't parse hex/decimal from character names (6.62 KB, patch)
2014-01-12 19:48 UTC, Jason Crain
Details | Splinter Review
Limit numeric parsing of character names (7.46 KB, patch)
2014-03-01 10:18 UTC, Jason Crain
Details | Splinter Review
Limit numeric parsing of character names (7.66 KB, patch)
2014-03-03 05:56 UTC, Jason Crain
Details | Splinter Review
Limit numeric parsing of character names (7.60 KB, patch)
2014-03-06 06:19 UTC, Jason Crain
Details | Splinter Review

Description thomas 2011-06-18 13:20:36 UTC
Created attachment 48144 [details]
PDF which illustrates the issue with text in small caps

When I copy the text "SmallCapsText" set in small capitals (using pdflatex) to the clipboard, I get "SćûĆĆCûĊčTÿĒĎ". Correspondingly, applications using poppler like okular and evince, cannot find text in typographic (genuine) small caps.

See attached PDF for example.

Compare the bug reports I filed with okular and evince (https://bugs.kde.org/show_bug.cgi?id=276001 and https://bugzilla.gnome.org/show_bug.cgi?id=652909 respectively).
Comment 1 Jason Crain 2014-01-12 19:48:52 UTC
Created attachment 91907 [details] [review]
Don't parse hex/decimal from character names

This document has type3 fonts with character names like /BD /BC /CD etc.  Poppler is using these names as hex code Unicode values.

The document in bug #38456 is similar. It's using names like /c251, /c255, /c262.  Poppler is using these numbers as the Unicode values.

Poppler and Xpdf are the only programs I've found that use the character name this way.  Others just use the charcode.  This patch removes the decimal and hex parsing and uses the charcode as fallback.

The side effects are mostly spacing differences from pdftotext due to adding charcode values that were previously left out.  The only document I've found that really breaks is the "Another pdf" attached to bug #16032, file name "FAO_Nutri_goodnutrition in Crisis.pdf".  It's using names /g84, /g104 and expects those names to be used as decimal Unicode values.  I don't know of a way to get both sets of these files to work at the same time, but maybe that's OK because the other programs I've tried can't extract text from this FAO document either.
Comment 2 Jason Crain 2014-01-12 19:51:04 UTC
well, meant to post this to bug #72753, but I guess this works too.
Comment 3 Albert Astals Cid 2014-01-26 15:34:51 UTC
After applying this patch, pdftotext over https://bugs.kde.org/attachment.cgi?id=10851 has this changes

-dense grid size (54 ϫ 21 ϫ 21), the PBM code consumes
+dense grid size (54 3 21 3 21), the PBM code consumes

The old extraction is not "totally exact" but is much closer to the real thing than a "3". Can you have a look to see if you can fix this "regression"?
Comment 4 Jason Crain 2014-01-26 21:57:11 UTC
(In reply to comment #3)
> The old extraction is not "totally exact" but is much closer to the real
> thing than a "3". Can you have a look to see if you can fix this
> "regression"?

Some other math documents have names that could be parsed, eg. summationdisplay, angbracketleft, producttext, so many of the math documents could be improved.  But this character is just named H11003 and poppler is somehow interpreting it as U+03EB COPTIC SMALL LETTER GANGIA (decimal code 1003).

So no, I can't fix this character.  I think it's just coincidence that the character looks a little like a multiplication sign.
Comment 5 Albert Astals Cid 2014-02-12 20:47:26 UTC
Can you have a look at https://bugs.kde.org/attachment.cgi?id=23655 ?

There's a whole lot of text missing in pdftotext with your path, the part that says

Type the password, up to six characters in length, and press <Enter>. The
password typed now will clear any previously set password from CMOS
memory. You will be prompted to confirm the password. Retype the password
and press <Enter>. You may also press <Esc> to abort the selection and not
enter a password.
To clear a set password, just press <Enter> when you are prompted to enter the
password. A message will show up confirming the password will be disabled.
Once the password is disabled, the system will boot and you can enter Setup
without entering any password.
When a password has been set, you will be prompted to enter it every time you
try to enter Setup. This prevents an unauthorized person from changing any
part of your system configuration.

is gone
Comment 6 Jason Crain 2014-02-14 09:22:19 UTC
(In reply to comment #5)
> Can you have a look at https://bugs.kde.org/attachment.cgi?id=23655 ?
> 
> There's a whole lot of text missing in pdftotext with your path, the part
> that says

This is another one I can't fix.  The document is using character names in the form of /GXX, where X are hex characters specifying a Unicode point.  I can't think of a way to reliably guess the document's intention to either parse the name or use the character code.

It looks like it will be a choice between supporting the documents in bugs #38456 and #72753 (use character code) or this document and the FAO document (parse name).
Comment 7 Albert Astals Cid 2014-02-17 23:14:32 UTC
Ok, Adobe Acrobat does indeed fail on that bug i said and not in the one in here, so i think it makes sense to match bug for bug what Adobe does, on the other hand, how hard would be to provide a command line switch to pdftotext to make https://bugs.kde.org/attachment.cgi?id=23655 still work? And if we do, can you think of a name for the command line switch?
Comment 8 Jason Crain 2014-02-19 12:17:13 UTC
I've downloaded more PDFs from bug trackers.  After looking through them, I think I might be able to get both sets of documents working without a command line switch.
Comment 9 Jason Crain 2014-03-01 10:18:10 UTC
Created attachment 94930 [details] [review]
Limit numeric parsing of character names

The documents using the hex/decimal name parsing have Differences arrays that start between character codes 0-2.  This updated patch changes it so these names are only parsed if the codes start in this range (plus a bit so I hopefully won't miss some documents), and if all the names in the array can be parsed numerically.  Moves this numeric parsing code into a new function and rewritten to better detect errors.  Changes the meaning of the 'numeric' parameter of parseCharName function slightly so I don't have to add a new and probably confusingly named parameter.
Comment 10 Albert Astals Cid 2014-03-01 20:35:37 UTC
Looks good to me, if noone else disagrees i'll commit it somewhen next week.
Comment 11 Jason Crain 2014-03-03 05:56:53 UTC
Created attachment 95001 [details] [review]
Limit numeric parsing of character names

Fixes memory leak in previous patch.  I had assumed that Objects free memory when destructed, but they don't.  Adds some .free() calls.
Comment 12 Albert Astals Cid 2014-03-04 22:47:00 UTC
You are now free'ing obj one time too many, no?
Comment 13 Jason Crain 2014-03-05 02:58:26 UTC
(In reply to comment #12)
> You are now free'ing obj one time too many, no?

I needed to make sure that obj is freed if it breaks out of the loop early.  Freeing it twice won't have an effect.  The first free sets it's type to objNone.  Second free is a no-op if the type is objNone.
Comment 14 Albert Astals Cid 2014-03-05 09:36:09 UTC
It's not a noop if you compile with DEBUG_MEM and want to check the frees/copy check. Please fix it.
Comment 15 Jason Crain 2014-03-06 06:19:45 UTC
Created attachment 95208 [details] [review]
Limit numeric parsing of character names
Comment 16 Albert Astals Cid 2014-03-11 23:35:37 UTC
Pushed. Thanks :-)
Comment 17 Jason Crain 2015-04-17 23:31:26 UTC
*** Bug 72753 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.