Summary: | Evince cannot copy text from PDF documents with Computer Modern fonts | ||
---|---|---|---|
Product: | poppler | Reporter: | Gökçen Eraslan <gokcen.eraslan> |
Component: | general | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | jason, seinsvergessen |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Problematic PDF file using CM fonts
Copying works fine in the file using LM fonts Use ZapfDingbats names to locate glyphs only The file where square changes to I Limit use of ZapfDingbats character names |
Created attachment 74146 [details]
Copying works fine in the file using LM fonts
By the way I'm using poppler 0.20.4-0ubuntu1.1, evince 3.6.0-0ubuntu2 and okular 4:4.9.4-0ubuntu0.1 in Ubuntu. Results are the same in both okular and evince. and I must state that acroread can copy text from both files correctly. One significant difference is that the LM fonts are type1 and the version of the CM fonts LyX embeds are type3 from the bitmaps generated by metafont. Neither includes a tounicode map, but the LM fonts have typical character names whereas the version of CM embedded has generic glyph names. Acroread, in that case, must trust that the ()-encoded text passed to TJ is ASCII. Or perhaps latin1 or adobe standard encoding. This seems to be a regression. Copying and searching text in the attached document (copy-bug-CM.pdf) works in Evince 2.30.3/poppler 0.12.4 . -ap Thanks politza for the info, i did a git bisect and the commit to blame is 126bf08105e319f9216654782e5a63f99f1d1825 is the first bad commit commit 126bf08105e319f9216654782e5a63f99f1d1825 Author: Albert Astals Cid <aacid@kde.org> Date: Sun Feb 19 23:18:25 2012 +0100 Update glyph names to Unicode values mapping Added Zapf Dingbat names and fixed copyrightsans, copyrightserif, registersans, registerserif, trademarksans, trademarkserif Kudos to Adrian Johnson for find what was missing :-) Bug #13131 :040000 040000 7cbaf40a93f8e49c086f80ad7d1054a10ae8a060 c4e3427944390237da8ec2c1e53e545ac815d667 M poppler Adrian any clue what may be going wrong in there? The Type 3 font is using the same glyph names as Zapf Dingbats (/a112, /a111, /a108 etc). As the pdf does not provide a ToUnicode, poppler uses the glyph names. Before the Zapf glyph names were added poppler assumed the glyph codes were text. So the great question here is, how does Adobe do it right if they are supossedly using the same mappings as we are? Should we not use that mapping for Type 3 fonts likes the one in this bug? Created attachment 83687 [details] [review] Use ZapfDingbats names to locate glyphs only (In reply to comment #8) > So the great question here is, how does Adobe do it right if they are > supossedly using the same mappings as we are? > > Should we not use that mapping for Type 3 fonts likes the one in this bug? I've created a few test PDFs to see how acroread uses character names. The short answer is they aren't using the same mappings. I've found that acroread ignores most character names for text extraction. This includes ZapfDingbats (a1-a206), but also many others. In total, acroread only uses about 700 of the 4k names in NameToUnicodeTable.h. In this case, it uses the character code for text. As stated in comment #7, this bug is because poppler uses ZapfDingbats names to find the text mapping, while acroread doesn't. But acroread *does* use character names when finding the glyph to display (in most cases - there seems to be some special treatment if the base font is ZapfDingbats). I think it has less trouble finding the correct glyph because it brings along its own fonts. For text extraction: if we try to emulate acroread too closely, some PDFs show regressions with text extraction. Mostly documents with mathematical symbols and a couple which include names like f.alt, uniFB00, or g84. Poppler parses these names, changes "f.alt" into "f" and looks it up through NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode values. As far as I can tell, acroread ignores these names and just uses the character code. The ZapfDingbats names are problematic because they are so generic. It is unlikely that a PDF producer would choose "omega" as a name unless it really wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a ZapfDingbats name like a102 and expecting a reader to use the character code like acroread, or Unicode value 102 like poppler. From bug #13131, it looks like the ZapfDingbats mappings are useful for locating glyphs, but this bug shows they shouldn't be used for text extraction. The attached patch moves the ZapfDingbats names in NameToUnicodeTable.h into a separate table and separates looking up Unicode values for text and for glyph IDs. (In reply to comment #9) > Created attachment 83687 [details] [review] [review] > Use ZapfDingbats names to locate glyphs only > > (In reply to comment #8) > > So the great question here is, how does Adobe do it right if they are > > supossedly using the same mappings as we are? > > > > Should we not use that mapping for Type 3 fonts likes the one in this bug? > > I've created a few test PDFs to see how acroread uses character names. The > short answer is they aren't using the same mappings. > > I've found that acroread ignores most character names for text extraction. > This includes ZapfDingbats (a1-a206), but also many others. In total, > acroread only uses about 700 of the 4k names in NameToUnicodeTable.h. In > this case, it uses the character code for text. As stated in comment #7, > this bug is because poppler uses ZapfDingbats names to find the text > mapping, while acroread doesn't. > > But acroread *does* use character names when finding the glyph to display > (in most cases - there seems to be some special treatment if the base font > is ZapfDingbats). I think it has less trouble finding the correct glyph > because it brings along its own fonts. > > For text extraction: if we try to emulate acroread too closely, some PDFs > show regressions with text extraction. Mostly documents with mathematical > symbols and a couple which include names like f.alt, uniFB00, or g84. > Poppler parses these names, changes "f.alt" into "f" and looks it up through > NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode > values. As far as I can tell, acroread ignores these names and just uses > the character code. > > The ZapfDingbats names are problematic because they are so generic. It is > unlikely that a PDF producer would choose "omega" as a name unless it really > wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a > ZapfDingbats name like a102 and expecting a reader to use the character code > like acroread, or Unicode value 102 like poppler. > > From bug #13131, it looks like the ZapfDingbats mappings are useful for > locating glyphs, but this bug shows they shouldn't be used for text > extraction. The attached patch moves the ZapfDingbats names in > NameToUnicodeTable.h into a separate table and separates looking up Unicode > values for text and for glyph IDs. It makes sense to me. Albert, could you pass the tests with the patch? Sorry, somehow i missed the patch, i'll run rendering and pdftotext regtests as soon as possible, which may not be this week since i have quite a busy week. (In reply to comment #11) > Sorry, somehow i missed the patch, i'll run rendering and pdftotext regtests > as soon as possible, which may not be this week since i have quite a busy > week. Sure, no hurries. Thanks! Doesn't look good, in the file i'll attach the pdftotext output changes from -■ Guido Westerwelles Beitrag zur aktuellen Debatte: „Wer Deutschland für kapitalistisch hält, +I Guido Westerwelles Beitrag zur aktuellen Debatte: „Wer Deutschland für kapitalistisch hält, Where the square is the correct character. Created attachment 88583 [details]
The file where square changes to I
Acroread copy & paste shows "n Guido Westerwelles" for that file, so I'm not sure poppler has any real obligation to show a square. The PDF has font "KJPQRP+ZapfDingbats", which uses character name /a73. I suppose you could treat a font with ZapfDingbats in the name specially and allow these /aXXX characters. Otherwise I don't think you can have both of these files work right. What do you say? It feels kludgy, but won't be hard to add a special case for ZapfDingbats fonts. If that special casing allows us to not regress and it's not loooooooots of lines of code i'd say that we should give it a go. Created attachment 89833 [details] [review] Limit use of ZapfDingbats character names Updated patch. Only use ZapfDingbats names to locate glyphs or for text extraction with ZapfDingbats fonts. *** Bug 72127 has been marked as a duplicate of this bug. *** Commited to master, will be in poppler >= 0.25.0 Have not commited to stable because even if it makes total sense it's a bit big-ish to make me totally confortable with it. Thanks a lot for the patch! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 74145 [details] Problematic PDF file using CM fonts When I open a PDF file created with LyX which uses Computer Modern fonts by default, I cannot copy text out from Evince. For example when I create a simple PDF containing "poppler test page" text and I try to copy all text, it results: ♣♦♣♣❧❡r t❡st ♣❛❣❡ However, when tell LyX to use Latin Modern fonts, result is OK: poppler test page Both files are rendered fine however copying is impossible if CM is used. I have attached the problematic file that makes us of Computer Modern fonts. When I uncompress the files, I can see same text in both of them: BT /F15 9.9626 Tf 148.712 657.235 Td [(p)-28(oppler)-333(test)-333(page)]TJ 154.422 -567.87 Td [(1)]TJ ET