Using pdftotext and the Greek support package. Capital omega and capital delta, possibly also other characters, appear in the output as symbols (Ohm, Delta. For the eye it's hardly noticable, but searches in the text, using strings containing one or more of those letters, fail due to the different code.
Attach a file?
Created attachment 82214 [details] output from a sample PDF (containing Greek text) Illustrates what happens with some Greek letters which also exist as symbols (e.g. Delta, Omega). For instance the word ΥΠΟΔΟΜΕΣ in the input PDF (decimal codes 933 928 927 916 927 924 917 931) becomes ΥΠΟ∆ΟΜΕΣ (decimal codes 933 928 927 8710 927 924 917 931) in the output Same for instance with ΟΡΓΑΝΩΣΗΣ (in the input PDF). This becomes ΟΡΓΑΝΩΣΗΣ, which contains the omega symbol (decimal 8486), not the letter (937)
Created attachment 82215 [details] the sample input PDF
Created attachment 83995 [details] [review] Add mappings to UnicodeDecompTables.h This patch should fix the searching. It adds some unicode mappings to gen-unicode-tables.py and UnicodeDecompTables.h so these greek letters are treated as equivalent for searches.
Jason, can you please provide a minimal patch, not one full of whitespace changes, makes it harder to review if for every line we have to see if it's just a whitespace change or something else changed in there
Created attachment 84036 [details] [review] Add Unicode mappings to gen-unicode-tables.py (In reply to comment #5) > Jason, can you please provide a minimal patch, not one full of whitespace > changes, makes it harder to review if for every line we have to see if it's > just a whitespace change or something else changed in there Due to the way gen-unicode-tables.py generates UnicodeDecompTables.h (basically, run `python gen-unicode-tables.py > UnicodeDecompTables.h' in a shell), that's not going to make the patch much smaller. I'll split it into separate patches for gen-unicode-tables and UnicodeDecompTables. Though I did find that two of the characters I was adding to gen-unicode-tables.py aren't necessary because they are already pulled from the python unicodedata.normalize function.
Created attachment 84037 [details] [review] Regenerate UnicodeDecompTables.h from gen-unicode-tables.py
Created attachment 84138 [details] [review] Normalize more characters in font Unicode map The previous patches were to fix searches. This one should fix pdftotext output. For this document, poppler is just using the Unicode values from the ToUnicode CMap, but acroread modifies or decomposes some of these characters, so I guess poppler should too. Attached patch moves the normalization pass to run after the CMap is read and additionally normalizes some characters that I noticed acroread changes. Some greek letters, OE ligatures, and presentation forms blocks. And all three patches should be applied, preferrably in this order: Add Unicode mappings to gen-unicode-tables.py Regenerate UnicodeDecompTables.h from gen-unicode-tables.py Normalize more characters in font Unicode map
Created attachment 84140 [details] PDF with all BMP chars Attached PDF contains every character in the UTF16 basic multilingual plane. You can use it to compare the text output differences between adobe reader and pdftotext. You should copy and paste from Adobe Reader rather than save as text, because I think saving as text converts it to an encoding where most of the characters are replaced with periods. Copying from Adobe Reader on Windows and pasting into notepad works better than acroread on Linux, which gets strange results.
Hmmmm, with this patches i get diffs in pdftotext output like -äöüÄÖÜáâàfifl©®@ŒÆØæƒÿ‡‰$£çABCD…XYZ (T1 / Garamond Bold) +äöüÄÖÜáâàfifl©®@OEÆØæƒÿ‡‰$£çABCD…XYZ (T1 / Garamond Bold) So we are converting Œ to OE, why are we doing that?
(In reply to comment #10) > So we are converting Œ to OE, why are we doing that? Because that's what adobe reader does. The same reason it converts greek symbols and alphabetic/arabic presentation forms. If you don't think that's a good reason, then feel free to skip the "Normalize more characters" patch. The other two by themselves should fix searching on greek letters.
I don't think it makes sense, but even if i did, why would we do Œ to OE but not Æ to AE ? Would applying the other two patches actually fix this bug? Because you say they will fix searching but the bug is about pdftotext
(In reply to comment #12) > I don't think it makes sense, but even if i did, why would we do > Œ to OE > but not > Æ to AE > ? According to wikipedia Œ is a ligature while Æ was originally a ligature but has since been promoted to the status of a letter in some languages.
Comment on attachment 84138 [details] [review] Normalize more characters in font Unicode map Review of attachment 84138 [details] [review]: ----------------------------------------------------------------- ::: poppler/GfxFont.cc @@ +1438,5 @@ > + || u[i] == 0x220F // â > + || u[i] == 0x2211 // â > + || (u[i] >= 0xFB00 && u[i] <= 0xFB4F) // Alphabetic Presentation Forms > + || (u[i] >= 0xFB50 && u[i] <= 0xFDFF) // Arabic Presentation Forms-A > + || (u[i] >= 0xFE70 && u[i] <= 0xFEFF) // Arabic Presentation Forms-B I don't like the way all the characters to normalize have been shoved into an if statement like this. Could they be put in a table or something where there is a separation between the list of characters and the code to perform the normalization? It would also be good if you could include the removed comment that provided examples of the alphabetic presentation forms 'eg "fi", "ffi"' as not everyone who will read this code is familiar with the various unicode ranges.
(In reply to comment #12) > I don't think it makes sense, but even if i did, why would we do > Œ to OE > but not > Æ to AE > ? Because, for whatever reason, Adobe Reader doesn't touch Æ or æ. You could also argue that the current pdftotext behavior for these math symbols is correct because even if Reader changes them into Greek letters, the characters are actually encoded in the document as math symbols. I'm ambivalent because I expect someone will complain either way. > Would applying the other two patches actually fix this bug? Because you say > they will fix searching but the bug is about pdftotext The original description says that search doesn't work because of the symbol/letter confusion. I assumed Govert meant using the search feature in, for example, Evince. The way search works, TextPage::findText calls unicodeNormalizeNFKC and searches through the normalized text. These two patches cause unicodeNormalizeNFKC to convert the math symbols to letters and the search matches. This works for Evince, anyway. I haven't tested it with Okular.
I am not going to recommend patch 3 because I am not sure if it is a good idea to change all of these characters. I still do recommend the other two patches for gen-unicode-tables.py and UnicodeDecompTables.h because they improve search in Evince for these letters. Govert: can you clarify what you mean when you say searches fail? Are you searching through pdftotext output? Or do you mean the search in Evince or Okular or some document viewer?
Comment # 16 on bug 66693 from Jason Crain Re your question... ... can you clarify what you mean when you say searches fail? Are yousearching through pdftotext output? Or do you mean the search in Evince orOkular or some document viewer? Using pdftotext text output. Regards, Govert
In that case, I don't think this will be changed. We'll have to see if Albert would prefer to leave them as math symbols as they are encoded in the document (as Foxit, FireFox, and Google Chrome do), or change them to Greek letters like Adobe Reader does.
To be honest, i don't see why pdftotext should output a symbol as another symbol, unless it's obvious that the first symbol is *exclusively* there for a typographical nature, like the "fl", "fi", ligatures. OTOH if the code is not a lot to maintain I would not be opposed to add a non default option that did that conversion. About searching, yes, i agree it makes sense that if you search for Symbol1 and what's on the pdf is Symbol2 (but that is "technically" the same thing), it would make sense sense that the search algorithm tries to match it, but I would still want the "getPageText()" methods to give me Symbol2 (i.e. what was really on the pdf file). So as far as I can see here there's two thigs happening in this bug: a) pdftotext doing conversion of some symbols to others b) search handling symbol mappings Am I right in the analysis? Now my question, how much is a) related to b). Can it be handled in different bugs or it makes more sense to handle them together here?
My apologies for this very late reaction. Just found this message (“Comment # 19 below) in my “unwanted mail” folder. I am not a specialist in this field, I’m not even Greek... The problem that I experienced is not about symbols being output as other symbols or even the need for that, it’s about certain letters in the PDF being output as symbols in the text. Example: a PDF containing text with a word that is displayed as ΑΓΝΩΣΤΗ in Adobe Reader (copied/pasted here from the Reader) is output as ΑΓΝΩΣΤΗ by pdftotext (copied/pasted here from the text output). Those words look equal, but they are not because the original Ω and the Ω in the text output are different. The Ω in the text output is the symbol omega (as used a.o. in electronics), not the letter Ω. Searching for “ΑΓΝΩΣΤΗ” in the text output finds nothing. I am using a (stupid?) workaround for the time being: I convert all symbols 'µ', '∆' and 'Ω' in the text output to the letters 'μ', 'Δ' and 'Ω' before starting the search. From: bugzilla-daemon@freedesktop.org Sent: Sunday, August 25, 2013 9:58 PM To: noliturbarecom@gmail.com Subject: [Bug 66693] Greek support package - some characters output as symbols not letters Comment # 19 on bug 66693 from Albert Astals Cid To be honest, i don't see why pdftotext should output a symbol as another symbol, unless it's obvious that the first symbol is *exclusively* there for a typographical nature, like the "fl", "fi", ligatures. OTOH if the code is not a lot to maintain I would not be opposed to add a non default option that did that conversion. About searching, yes, i agree it makes sense that if you search for Symbol1 and what's on the pdf is Symbol2 (but that is "technically" the same thing), it would make sense sense that the search algorithm tries to match it, but I would still want the "getPageText()" methods to give me Symbol2 (i.e. what was really on the pdf file). So as far as I can see here there's two thigs happening in this bug: a) pdftotext doing conversion of some symbols to others b) search handling symbol mappings Am I right in the analysis? Now my question, how much is a) related to b). Can it be handled in different bugs or it makes more sense to handle them together here? -------------------------------------------------------------------------------- You are receiving this mail because: a.. You reported the bug.
--- Comment #20 from Govert <noliturbarecom@gmail.com> --- > I am not a specialist in this field, I’m not even Greek... The problem that I > experienced is not about symbols being output as other symbols or even the need > for that, it’s about certain letters in the PDF being output as symbols in the > text. The document specifies that these characters are math symbols and poppler is doing what the document requests. If you want it fixed, you probably should complain to whoever created the document. If there is still an interest I'll see if I can add an option to pdftotext (does -convchars sound OK? I'm terrible with names). I haven't had much free time, so it won't be until this weekend that I can look at it.
>If there is still an interest I'll see if I can add an option to pdftotext (does -convchars sound OK? I'm terrible with names). I haven't had much free time, so it won't be until this weekend that I can look at it. Thanks. Do’nt bother – unless it is seen as a general improvement; I can live with my own workaround
I'm with Jason here, we are doing the right thing by giving back what the document says it has.
As an aside: IIRC, Adobe at some point changed its recomendation for the mappings of certain postscript glyph names to/from UCS character codes. That may have something to do with /why/ this document uses the wrong tounicode mapping for those characters. If so, that probably also is why Adobe added code to prefer the text characters when the symbols are specified w/in running Ελληνικά.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.