Bug 66693 - Greek support package - some characters output as symbols not letters
Summary: Greek support package - some characters output as symbols not letters
Status: RESOLVED INVALID
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: All Windows (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-08 11:48 UTC by Govert
Modified: 2013-09-26 00:01 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
output from a sample PDF (containing Greek text) (33.30 KB, application/octet-stream)
2013-07-09 08:21 UTC, Govert
Details
the sample input PDF (123.58 KB, application/pdf)
2013-07-09 08:22 UTC, Govert
Details
Add mappings to UnicodeDecompTables.h (497.58 KB, patch)
2013-08-13 04:46 UTC, Jason Crain
Details | Splinter Review
Add Unicode mappings to gen-unicode-tables.py (1.46 KB, patch)
2013-08-14 02:26 UTC, Jason Crain
Details | Splinter Review
Regenerate UnicodeDecompTables.h from gen-unicode-tables.py (487.67 KB, patch)
2013-08-14 02:29 UTC, Jason Crain
Details | Splinter Review
Normalize more characters in font Unicode map (4.57 KB, patch)
2013-08-16 11:31 UTC, Jason Crain
Details | Splinter Review
PDF with all BMP chars (1.66 MB, application/pdf)
2013-08-16 11:54 UTC, Jason Crain
Details

Description Govert 2013-07-08 11:48:58 UTC
Using pdftotext and the Greek support package.

Capital omega and capital delta, possibly also other characters, appear in the output as symbols (Ohm, Delta. For the eye it's hardly noticable, but searches in the text, using strings containing one or more of those letters, fail due to the different code.
Comment 1 Albert Astals Cid 2013-07-08 12:57:56 UTC
Attach a file?
Comment 2 Govert 2013-07-09 08:21:12 UTC
Created attachment 82214 [details]
output from a sample PDF (containing Greek text)

Illustrates what happens with some Greek letters which also exist as symbols (e.g. Delta, Omega). 

For instance the word ΥΠΟΔΟΜΕΣ in the input PDF (decimal codes 933 928 927 916 927 924 917 931)

becomes

ΥΠΟ∆ΟΜΕΣ (decimal codes 933 928 927 8710 927 924 917 931)

in the output

Same for instance with ΟΡΓΑΝΩΣΗΣ (in the input PDF). This becomes ΟΡΓΑΝΩΣΗΣ, which contains the omega symbol (decimal 8486), not the letter (937)
Comment 3 Govert 2013-07-09 08:22:16 UTC
Created attachment 82215 [details]
the sample input PDF
Comment 4 Jason Crain 2013-08-13 04:46:21 UTC
Created attachment 83995 [details] [review]
Add mappings to UnicodeDecompTables.h

This patch should fix the searching.  It adds some unicode mappings to gen-unicode-tables.py and UnicodeDecompTables.h so these greek letters are treated as equivalent for searches.
Comment 5 Albert Astals Cid 2013-08-13 17:45:31 UTC
Jason, can you please provide a minimal patch, not one full of whitespace changes, makes it harder to review if for every line we have to see if it's just a whitespace change or something else changed in there
Comment 6 Jason Crain 2013-08-14 02:26:43 UTC
Created attachment 84036 [details] [review]
Add Unicode mappings to gen-unicode-tables.py

(In reply to comment #5)
> Jason, can you please provide a minimal patch, not one full of whitespace
> changes, makes it harder to review if for every line we have to see if it's
> just a whitespace change or something else changed in there

Due to the way gen-unicode-tables.py generates UnicodeDecompTables.h (basically, run `python gen-unicode-tables.py > UnicodeDecompTables.h' in a shell), that's not going to make the patch much smaller.  I'll split it into separate patches for gen-unicode-tables and UnicodeDecompTables.  Though I did find that two of the characters I was adding to gen-unicode-tables.py aren't necessary because they are already pulled from the python unicodedata.normalize function.
Comment 7 Jason Crain 2013-08-14 02:29:52 UTC
Created attachment 84037 [details] [review]
Regenerate UnicodeDecompTables.h from gen-unicode-tables.py
Comment 8 Jason Crain 2013-08-16 11:31:49 UTC
Created attachment 84138 [details] [review]
Normalize more characters in font Unicode map

The previous patches were to fix searches.  This one should fix pdftotext output.  For this document, poppler is just using the Unicode values from the ToUnicode CMap, but acroread modifies or decomposes some of these characters, so I guess poppler should too.

Attached patch moves the normalization pass to run after the CMap is read and additionally normalizes some characters that I noticed acroread changes.  Some greek letters, OE ligatures, and presentation forms blocks.

And all three patches should be applied, preferrably in this order:

Add Unicode mappings to gen-unicode-tables.py
Regenerate UnicodeDecompTables.h from gen-unicode-tables.py
Normalize more characters in font Unicode map
Comment 9 Jason Crain 2013-08-16 11:54:30 UTC
Created attachment 84140 [details]
PDF with all BMP chars

Attached PDF contains every character in the UTF16 basic multilingual plane.  You can use it to compare the text output differences between adobe reader and pdftotext.  You should copy and paste from Adobe Reader rather than save as text, because I think saving as text converts it to an encoding where most of the characters are replaced with periods.  Copying from Adobe Reader on Windows and pasting into notepad works better than acroread on Linux, which gets strange results.
Comment 10 Albert Astals Cid 2013-08-16 22:36:42 UTC
Hmmmm, with this patches i get diffs in pdftotext output like

-äöüÄÖÜáâàfifl©®@ŒÆØæƒÿ‡‰$£çABCD…XYZ (T1 / Garamond Bold)
+äöüÄÖÜáâàfifl©®@OEÆØæƒÿ‡‰$£çABCD…XYZ (T1 / Garamond Bold)

So we are converting Πto OE, why are we doing that?
Comment 11 Jason Crain 2013-08-16 23:16:53 UTC
(In reply to comment #10)
> So we are converting Πto OE, why are we doing that?

Because that's what adobe reader does.  The same reason it converts greek symbols and alphabetic/arabic presentation forms.  If you don't think that's a good reason, then feel free to skip the "Normalize more characters" patch.  The other two by themselves should fix searching on greek letters.
Comment 12 Albert Astals Cid 2013-08-16 23:37:48 UTC
I don't think it makes sense, but even if i did, why would we do
Πto OE
but not
Æ to AE
?

Would applying the other two patches actually fix this bug? Because you say they will fix searching but the bug is about pdftotext
Comment 13 Adrian Johnson 2013-08-17 12:29:39 UTC
(In reply to comment #12)
> I don't think it makes sense, but even if i did, why would we do
> Πto OE
> but not
> Æ to AE
> ?

According to wikipedia Œ is a ligature while Æ was originally a ligature but has since been promoted to the status of a letter in some languages.
Comment 14 Adrian Johnson 2013-08-17 12:40:24 UTC
Comment on attachment 84138 [details] [review]
Normalize more characters in font Unicode map

Review of attachment 84138 [details] [review]:
-----------------------------------------------------------------

::: poppler/GfxFont.cc
@@ +1438,5 @@
> +	  || u[i] == 0x220F // â
> +	  || u[i] == 0x2211 // â
> +	  || (u[i] >= 0xFB00 && u[i] <= 0xFB4F) // Alphabetic Presentation Forms
> +	  || (u[i] >= 0xFB50 && u[i] <= 0xFDFF) // Arabic Presentation Forms-A
> +	  || (u[i] >= 0xFE70 && u[i] <= 0xFEFF) // Arabic Presentation Forms-B

I don't like the way all the characters to normalize have been shoved into an if statement like this. Could they be put in a table or something where there is a separation between the list of characters and the code to perform the normalization?

It would also be good if you could include the removed comment that provided examples of the alphabetic presentation forms 'eg "fi", "ffi"' as not everyone who will read this code is familiar with the various unicode ranges.
Comment 15 Jason Crain 2013-08-17 13:49:09 UTC
(In reply to comment #12)
> I don't think it makes sense, but even if i did, why would we do
> Πto OE
> but not
> Æ to AE
> ?

Because, for whatever reason, Adobe Reader doesn't touch Æ or æ.

You could also argue that the current pdftotext behavior for these math symbols is correct because even if Reader changes them into Greek letters, the characters are actually encoded in the document as math symbols.  I'm ambivalent because I expect someone will complain either way.

> Would applying the other two patches actually fix this bug? Because you say
> they will fix searching but the bug is about pdftotext

The original description says that search doesn't work because of the symbol/letter confusion.  I assumed Govert meant using the search feature in, for example, Evince.  The way search works, TextPage::findText calls unicodeNormalizeNFKC and searches through the normalized text.  These two patches cause unicodeNormalizeNFKC to convert the math symbols to letters and the search matches.  This works for Evince, anyway.  I haven't tested it with Okular.
Comment 16 Jason Crain 2013-08-21 06:11:09 UTC
I am not going to recommend patch 3 because I am not sure if it is a good idea to change all of these characters.  I still do recommend the other two patches for gen-unicode-tables.py and UnicodeDecompTables.h because they improve search in Evince for these letters.

Govert: can you clarify what you mean when you say searches fail?  Are you searching through pdftotext output?  Or do you mean the search in Evince or Okular or some document viewer?
Comment 17 Govert 2013-08-21 13:42:07 UTC
Comment # 16 on bug 66693 from Jason Crain 

Re your question...

... can you clarify what you mean when you say searches fail? Are yousearching through pdftotext output? Or do you mean the search in Evince orOkular or some document viewer?

Using pdftotext text output.

Regards, Govert
Comment 18 Jason Crain 2013-08-21 16:15:30 UTC
In that case, I don't think this will be changed.  We'll have to see
if Albert would prefer to leave them as math symbols as they are
encoded in the document (as Foxit, FireFox, and Google Chrome do), or
change them to Greek letters like Adobe Reader does.
Comment 19 Albert Astals Cid 2013-08-25 19:58:10 UTC
To be honest, i don't see why pdftotext should output a symbol as another symbol, unless it's obvious that the first symbol is *exclusively* there for a typographical nature, like the "fl", "fi", ligatures.

OTOH if the code is not a lot to maintain I would not be opposed to add a non default option that did that conversion.

About searching, yes, i agree it makes sense that if you search for Symbol1 and what's on the pdf is Symbol2 (but that is "technically" the same thing), it would make sense sense that the search algorithm tries to match it, but I would still want the "getPageText()" methods to give me Symbol2 (i.e. what was really on the pdf file).

So as far as I can see here there's two thigs happening in this bug:
 a) pdftotext doing conversion of some symbols to others
 b) search handling symbol mappings

Am I right in the analysis?

Now my question, how much is a) related to b). Can it be handled in different bugs or it makes more sense to handle them together here?
Comment 20 Govert 2013-09-18 10:00:50 UTC
My apologies for this very late reaction. Just found this message (“Comment # 19 below) in my “unwanted mail” folder.

I am not a specialist in this field, I’m not even Greek... The problem that I experienced is not about symbols being output as other symbols or even the need for that, it’s about certain letters in the PDF being output as symbols in the text.

Example: a PDF containing text  with a word that is displayed as ΑΓΝΩΣΤΗ in Adobe Reader (copied/pasted here from the Reader) is output as ΑΓΝΩΣΤΗ by pdftotext (copied/pasted here from the text output). Those words look equal, but they are not because the original Ω and the Ω in the text output are different. The Ω in the text output is the symbol omega (as used a.o. in electronics), not the letter Ω. Searching for “ΑΓΝΩΣΤΗ” in the text output finds nothing.

I am using a (stupid?) workaround for the time being: I convert all symbols 'µ', '∆' and 'Ω' in the text output to the letters 'μ', 'Δ' and 'Ω' before starting the search.



From: bugzilla-daemon@freedesktop.org 
Sent: Sunday, August 25, 2013 9:58 PM
To: noliturbarecom@gmail.com 
Subject: [Bug 66693] Greek support package - some characters output as symbols not letters


Comment # 19 on bug 66693 from Albert Astals Cid 
To be honest, i don't see why pdftotext should output a symbol as another
symbol, unless it's obvious that the first symbol is *exclusively* there for a
typographical nature, like the "fl", "fi", ligatures.

OTOH if the code is not a lot to maintain I would not be opposed to add a non
default option that did that conversion.

About searching, yes, i agree it makes sense that if you search for Symbol1 and
what's on the pdf is Symbol2 (but that is "technically" the same thing), it
would make sense sense that the search algorithm tries to match it, but I would
still want the "getPageText()" methods to give me Symbol2 (i.e. what was really
on the pdf file).

So as far as I can see here there's two thigs happening in this bug:
 a) pdftotext doing conversion of some symbols to others
 b) search handling symbol mappings

Am I right in the analysis?

Now my question, how much is a) related to b). Can it be handled in different
bugs or it makes more sense to handle them together here?
--------------------------------------------------------------------------------
You are receiving this mail because: 
  a.. You reported the bug.
Comment 21 Jason Crain 2013-09-18 17:25:39 UTC
--- Comment #20 from Govert <noliturbarecom@gmail.com> ---
> I am not a specialist in this field, I’m not even Greek... The problem that I
> experienced is not about symbols being output as other symbols or even the need
> for that, it’s about certain letters in the PDF being output as symbols in the
> text.

The document specifies that these characters are math symbols and
poppler is doing what the document requests.  If you want it fixed,
you probably should complain to whoever created the document.

If there is still an interest I'll see if I can add an option to
pdftotext (does -convchars sound OK?  I'm terrible with names).  I
haven't had much free time, so it won't be until this weekend that I
can look at it.
Comment 22 Govert 2013-09-19 09:46:54 UTC
>If there is still an interest I'll see if I can add an option to pdftotext (does -convchars sound OK?  I'm terrible with names).  I haven't had much free time, so it won't be until this weekend that I can look at it.
Thanks. Do’nt bother – unless it is seen as a general improvement; I can live with my own workaround
Comment 23 Albert Astals Cid 2013-09-25 18:26:49 UTC
I'm with Jason here, we are doing the right thing by giving back what the document says it has.
Comment 24 James Cloos 2013-09-26 00:01:52 UTC
As an aside:

IIRC, Adobe at some point changed its recomendation for the mappings of
certain postscript glyph names to/from UCS character codes.

That may have something to do with /why/ this document uses the wrong
tounicode mapping for those characters.

If so, that probably also is why Adobe added code to prefer the text
characters when the symbols are specified w/in running Ελληνικά.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.