Created attachment 48144 [details] PDF which illustrates the issue with text in small caps When I copy the text "SmallCapsText" set in small capitals (using pdflatex) to the clipboard, I get "SćûĆĆCûĊčTÿĒĎ". Correspondingly, applications using poppler like okular and evince, cannot find text in typographic (genuine) small caps. See attached PDF for example. Compare the bug reports I filed with okular and evince (https://bugs.kde.org/show_bug.cgi?id=276001 and https://bugzilla.gnome.org/show_bug.cgi?id=652909 respectively).
Created attachment 91907 [details] [review] Don't parse hex/decimal from character names This document has type3 fonts with character names like /BD /BC /CD etc. Poppler is using these names as hex code Unicode values. The document in bug #38456 is similar. It's using names like /c251, /c255, /c262. Poppler is using these numbers as the Unicode values. Poppler and Xpdf are the only programs I've found that use the character name this way. Others just use the charcode. This patch removes the decimal and hex parsing and uses the charcode as fallback. The side effects are mostly spacing differences from pdftotext due to adding charcode values that were previously left out. The only document I've found that really breaks is the "Another pdf" attached to bug #16032, file name "FAO_Nutri_goodnutrition in Crisis.pdf". It's using names /g84, /g104 and expects those names to be used as decimal Unicode values. I don't know of a way to get both sets of these files to work at the same time, but maybe that's OK because the other programs I've tried can't extract text from this FAO document either.
well, meant to post this to bug #72753, but I guess this works too.
After applying this patch, pdftotext over https://bugs.kde.org/attachment.cgi?id=10851 has this changes -dense grid size (54 ϫ 21 ϫ 21), the PBM code consumes +dense grid size (54 3 21 3 21), the PBM code consumes The old extraction is not "totally exact" but is much closer to the real thing than a "3". Can you have a look to see if you can fix this "regression"?
(In reply to comment #3) > The old extraction is not "totally exact" but is much closer to the real > thing than a "3". Can you have a look to see if you can fix this > "regression"? Some other math documents have names that could be parsed, eg. summationdisplay, angbracketleft, producttext, so many of the math documents could be improved. But this character is just named H11003 and poppler is somehow interpreting it as U+03EB COPTIC SMALL LETTER GANGIA (decimal code 1003). So no, I can't fix this character. I think it's just coincidence that the character looks a little like a multiplication sign.
Can you have a look at https://bugs.kde.org/attachment.cgi?id=23655 ? There's a whole lot of text missing in pdftotext with your path, the part that says Type the password, up to six characters in length, and press <Enter>. The password typed now will clear any previously set password from CMOS memory. You will be prompted to confirm the password. Retype the password and press <Enter>. You may also press <Esc> to abort the selection and not enter a password. To clear a set password, just press <Enter> when you are prompted to enter the password. A message will show up confirming the password will be disabled. Once the password is disabled, the system will boot and you can enter Setup without entering any password. When a password has been set, you will be prompted to enter it every time you try to enter Setup. This prevents an unauthorized person from changing any part of your system configuration. is gone
(In reply to comment #5) > Can you have a look at https://bugs.kde.org/attachment.cgi?id=23655 ? > > There's a whole lot of text missing in pdftotext with your path, the part > that says This is another one I can't fix. The document is using character names in the form of /GXX, where X are hex characters specifying a Unicode point. I can't think of a way to reliably guess the document's intention to either parse the name or use the character code. It looks like it will be a choice between supporting the documents in bugs #38456 and #72753 (use character code) or this document and the FAO document (parse name).
Ok, Adobe Acrobat does indeed fail on that bug i said and not in the one in here, so i think it makes sense to match bug for bug what Adobe does, on the other hand, how hard would be to provide a command line switch to pdftotext to make https://bugs.kde.org/attachment.cgi?id=23655 still work? And if we do, can you think of a name for the command line switch?
I've downloaded more PDFs from bug trackers. After looking through them, I think I might be able to get both sets of documents working without a command line switch.
Created attachment 94930 [details] [review] Limit numeric parsing of character names The documents using the hex/decimal name parsing have Differences arrays that start between character codes 0-2. This updated patch changes it so these names are only parsed if the codes start in this range (plus a bit so I hopefully won't miss some documents), and if all the names in the array can be parsed numerically. Moves this numeric parsing code into a new function and rewritten to better detect errors. Changes the meaning of the 'numeric' parameter of parseCharName function slightly so I don't have to add a new and probably confusingly named parameter.
Looks good to me, if noone else disagrees i'll commit it somewhen next week.
Created attachment 95001 [details] [review] Limit numeric parsing of character names Fixes memory leak in previous patch. I had assumed that Objects free memory when destructed, but they don't. Adds some .free() calls.
You are now free'ing obj one time too many, no?
(In reply to comment #12) > You are now free'ing obj one time too many, no? I needed to make sure that obj is freed if it breaks out of the loop early. Freeing it twice won't have an effect. The first free sets it's type to objNone. Second free is a no-op if the type is objNone.
It's not a noop if you compile with DEBUG_MEM and want to check the frees/copy check. Please fix it.
Created attachment 95208 [details] [review] Limit numeric parsing of character names
Pushed. Thanks :-)
*** Bug 72753 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.