60243 – Evince cannot copy text from PDF documents with Computer Modern fonts

Bug 60243 - Evince cannot copy text from PDF documents with Computer Modern fonts

Summary: Evince cannot copy text from PDF documents with Computer Modern fonts

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	72127 (view as bug list)
Depends on:
Blocks:

Reported:	2013-02-03 16:59 UTC by Gökçen Eraslan
Modified:	2013-11-30 16:32 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Problematic PDF file using CM fonts (3.80 KB, application/pdf) 2013-02-03 16:59 UTC, Gökçen Eraslan	Details
Copying works fine in the file using LM fonts (24.05 KB, application/pdf) 2013-02-03 17:00 UTC, Gökçen Eraslan	Details
Use ZapfDingbats names to locate glyphs only (14.71 KB, patch) 2013-08-06 02:15 UTC, Jason Crain	Details \| Splinter Review
The file where square changes to I (84.37 KB, application/octet-stream) 2013-11-03 22:55 UTC, Albert Astals Cid	Details
Limit use of ZapfDingbats character names (16.29 KB, patch) 2013-11-26 12:05 UTC, Jason Crain	Details \| Splinter Review
Show Obsolete (1) View All

Description Gökçen Eraslan 2013-02-03 16:59:18 UTC

Created attachment 74145 [details]
Problematic PDF file using CM fonts

When I open a PDF file created with LyX which uses Computer Modern fonts by default, I cannot copy text out from Evince. For example when I create a simple PDF containing "poppler test page" text and I try to copy all text, it results:

♣♦♣♣❧❡r t❡st ♣❛❣❡

However, when tell LyX to use Latin Modern fonts, result is OK:

poppler test page

Both files are rendered fine however copying is impossible if CM is used.

I have attached the problematic file that makes us of Computer Modern fonts. When I uncompress the files, I can see same text in both of them:

BT                                                                                                                                                       
 /F15 9.9626 Tf  148.712 657.235 Td  [(p)-28(oppler)-333(test)-333(page)]TJ   
 154.422 -567.87 Td  [(1)]TJ                                                  
ET

Comment 1 Gökçen Eraslan 2013-02-03 17:00:16 UTC

Created attachment 74146 [details]
Copying works fine in the file using LM fonts

Comment 2 Gökçen Eraslan 2013-02-03 17:01:28 UTC

By the way I'm using poppler 0.20.4-0ubuntu1.1, evince 3.6.0-0ubuntu2 and okular 4:4.9.4-0ubuntu0.1 in Ubuntu. Results are the same in both okular and evince.

Comment 3 Gökçen Eraslan 2013-02-03 17:15:14 UTC

and I must state that acroread can copy text from both files correctly.

Comment 4 James Cloos 2013-02-04 03:37:10 UTC

One significant difference is that the LM fonts are type1 and the
version of the CM fonts LyX embeds are type3 from the bitmaps generated
by metafont.

Neither includes a tounicode map, but the LM fonts have typical
character names whereas the version of CM embedded has generic glyph
names.

Acroread, in that case, must trust that the ()-encoded text passed to TJ
is ASCII.  Or perhaps latin1 or adobe standard encoding.

Comment 5 politza 2013-04-12 20:27:58 UTC

This seems to be a regression. Copying and searching text in the
attached document (copy-bug-CM.pdf) works in Evince 2.30.3/poppler
0.12.4 .

-ap

Comment 6 Albert Astals Cid 2013-04-14 23:17:23 UTC

Thanks politza for the info, i did a git bisect and the commit to blame is

126bf08105e319f9216654782e5a63f99f1d1825 is the first bad commit
commit 126bf08105e319f9216654782e5a63f99f1d1825
Author: Albert Astals Cid <aacid@kde.org>
Date:   Sun Feb 19 23:18:25 2012 +0100

    Update glyph names to Unicode values mapping
    
    Added Zapf Dingbat names and fixed copyrightsans, copyrightserif, registersans, registerserif, trademarksans, trademarkserif
    Kudos to Adrian Johnson for find what was missing :-)
    Bug #13131

:040000 040000 7cbaf40a93f8e49c086f80ad7d1054a10ae8a060 c4e3427944390237da8ec2c1e53e545ac815d667 M      poppler



Adrian any clue what may be going wrong in there?

Comment 7 Adrian Johnson 2013-04-16 14:12:11 UTC

The Type 3 font is using the same glyph names as Zapf Dingbats (/a112, /a111, /a108 etc). As the pdf does not provide a ToUnicode, poppler uses the glyph names. Before the Zapf glyph names were added poppler assumed the glyph codes were text.

Comment 8 Albert Astals Cid 2013-04-25 22:43:39 UTC

So the great question here is, how does Adobe do it right if they are supossedly using the same mappings as we are?

Should we not use that mapping for Type 3 fonts likes the one in this bug?

Comment 9 Jason Crain 2013-08-06 02:15:43 UTC

Created attachment 83687 [details] [review]
Use ZapfDingbats names to locate glyphs only

(In reply to comment #8)
> So the great question here is, how does Adobe do it right if they are
> supossedly using the same mappings as we are?
> 
> Should we not use that mapping for Type 3 fonts likes the one in this bug?

I've created a few test PDFs to see how acroread uses character names.  The short answer is they aren't using the same mappings.

I've found that acroread ignores most character names for text extraction.  This includes ZapfDingbats (a1-a206), but also many others.  In total, acroread only uses about 700 of the 4k names in NameToUnicodeTable.h.  In this case, it uses the character code for text.  As stated in comment #7, this bug is because poppler uses ZapfDingbats names to find the text mapping, while acroread doesn't.

But acroread *does* use character names when finding the glyph to display (in most cases - there seems to be some special treatment if the base font is ZapfDingbats).  I think it has less trouble finding the correct glyph because it brings along its own fonts.

For text extraction: if we try to emulate acroread too closely, some PDFs show regressions with text extraction.  Mostly documents with mathematical symbols and a couple which include names like f.alt, uniFB00, or g84.  Poppler parses these names, changes "f.alt" into "f" and looks it up through NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode values.  As far as I can tell, acroread ignores these names and just uses the character code.

The ZapfDingbats names are problematic because they are so generic. It is unlikely that a PDF producer would choose "omega" as a name unless it really wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a ZapfDingbats name like a102 and expecting a reader to use the character code like acroread, or Unicode value 102 like poppler.

From bug #13131, it looks like the ZapfDingbats mappings are useful for locating glyphs, but this bug shows they shouldn't be used for text extraction.  The attached patch moves the ZapfDingbats names in NameToUnicodeTable.h into a separate table and separates looking up Unicode values for text and for glyph IDs.

Comment 10 Carlos Garcia Campos 2013-10-22 11:15:09 UTC

(In reply to comment #9)
> Created attachment 83687 [details] [review] [review]
> Use ZapfDingbats names to locate glyphs only
> 
> (In reply to comment #8)
> > So the great question here is, how does Adobe do it right if they are
> > supossedly using the same mappings as we are?
> > 
> > Should we not use that mapping for Type 3 fonts likes the one in this bug?
> 
> I've created a few test PDFs to see how acroread uses character names.  The
> short answer is they aren't using the same mappings.
> 
> I've found that acroread ignores most character names for text extraction. 
> This includes ZapfDingbats (a1-a206), but also many others.  In total,
> acroread only uses about 700 of the 4k names in NameToUnicodeTable.h.  In
> this case, it uses the character code for text.  As stated in comment #7,
> this bug is because poppler uses ZapfDingbats names to find the text
> mapping, while acroread doesn't.
> 
> But acroread *does* use character names when finding the glyph to display
> (in most cases - there seems to be some special treatment if the base font
> is ZapfDingbats).  I think it has less trouble finding the correct glyph
> because it brings along its own fonts.
> 
> For text extraction: if we try to emulate acroread too closely, some PDFs
> show regressions with text extraction.  Mostly documents with mathematical
> symbols and a couple which include names like f.alt, uniFB00, or g84. 
> Poppler parses these names, changes "f.alt" into "f" and looks it up through
> NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode
> values.  As far as I can tell, acroread ignores these names and just uses
> the character code.
> 
> The ZapfDingbats names are problematic because they are so generic. It is
> unlikely that a PDF producer would choose "omega" as a name unless it really
> wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a
> ZapfDingbats name like a102 and expecting a reader to use the character code
> like acroread, or Unicode value 102 like poppler.
> 
> From bug #13131, it looks like the ZapfDingbats mappings are useful for
> locating glyphs, but this bug shows they shouldn't be used for text
> extraction.  The attached patch moves the ZapfDingbats names in
> NameToUnicodeTable.h into a separate table and separates looking up Unicode
> values for text and for glyph IDs.

It makes sense to me. Albert, could you pass the tests with the patch?

Comment 11 Albert Astals Cid 2013-10-22 16:54:20 UTC

Sorry, somehow i missed the patch, i'll run rendering and pdftotext regtests as soon as possible, which may not be this week since i have quite a busy week.

Comment 12 Carlos Garcia Campos 2013-10-23 09:08:59 UTC

(In reply to comment #11)
> Sorry, somehow i missed the patch, i'll run rendering and pdftotext regtests
> as soon as possible, which may not be this week since i have quite a busy
> week.

Sure, no hurries. Thanks!

Comment 13 Albert Astals Cid 2013-11-03 22:52:16 UTC

Doesn't look good, in the file i'll attach the pdftotext output changes from 
-■ Guido Westerwelles Beitrag zur aktuellen Debatte: „Wer Deutschland für kapitalistisch hält,
+I Guido Westerwelles Beitrag zur aktuellen Debatte: „Wer Deutschland für kapitalistisch hält,

Where the square is the correct character.

Comment 14 Albert Astals Cid 2013-11-03 22:55:41 UTC

Created attachment 88583 [details]
The file where square changes to I

Comment 15 Jason Crain 2013-11-08 06:51:59 UTC

Acroread copy & paste shows "n Guido Westerwelles" for that file, so I'm not sure poppler has any real obligation to show a square.

The PDF has font "KJPQRP+ZapfDingbats", which uses character name /a73.  I suppose you could treat a font with ZapfDingbats in the name specially and allow these /aXXX characters.  Otherwise I don't think you can have both of these files work right.

What do you say?  It feels kludgy, but won't be hard to add a special case for ZapfDingbats fonts.

Comment 16 Albert Astals Cid 2013-11-14 22:25:28 UTC

If that special casing allows us to not regress and it's not loooooooots of lines of code i'd say that we should give it a go.

Comment 17 Jason Crain 2013-11-26 12:05:33 UTC

Created attachment 89833 [details] [review]
Limit use of ZapfDingbats character names

Updated patch.  Only use ZapfDingbats names to locate glyphs or for text extraction with ZapfDingbats fonts.

Comment 18 Albert Astals Cid 2013-11-28 23:42:24 UTC

*** Bug 72127 has been marked as a duplicate of this bug. ***

Comment 19 Albert Astals Cid 2013-11-30 16:32:27 UTC

Commited to master, will be in poppler >= 0.25.0

Have not commited to stable because even if it makes total sense it's a bit big-ish to make me totally confortable with it.

Thanks a lot for the patch!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.