Bug 66569 - Wrong selection with umlauts
Summary: Wrong selection with umlauts
Status: RESOLVED DUPLICATE of bug 87215
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL: https://bugzilla.gnome.org/show_bug.c...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-04 03:14 UTC by Germán Poo-Caamaño
Modified: 2015-04-17 04:32 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
PDF test with umlauts (6.96 KB, application/pdf)
2013-07-04 03:14 UTC, Germán Poo-Caamaño
Details

Description Germán Poo-Caamaño 2013-07-04 03:14:15 UTC
Created attachment 82002 [details]
PDF test with umlauts

Forwarded from Gnome bugzilla. The following bug appears when using evince or poppler-glib-demo to select/copy/paste selections:

When selecting stuff in pdf files I often end up with wrong selections when for
example german umlauts are involved. An example:


---
    a                                              a            Danach kann die
Be-
nutzerin aus allen Gruppen-IDs eine ausw¨hlen. Daraufhin wird ein Textfeld
angezeigt,
                                        a
---

correctly it would be:

---
Danach kann die Be-nutzerin aus allen Gruppen-IDs eine auswählen. Daraufhin
wird ein Textfeld angezeigt,
---


The pdf test case, provides a minimal (german) testcase. Pasting the first line results in:

---
ao ̈
 ̈ ̈uß
---

The fonts are Type 1 and the bug reports contains the source in TeX which was compiled with pdflatex.

https://bugzilla.gnome.org/show_bug.cgi?id=397650
Comment 1 gadelat+freedesktop 2013-08-07 20:34:47 UTC
Selection is ok for me, but after copying it copies:

a ̈
o u
 ̈
 ̈ ß
Comment 2 kurt.pfeifle 2013-11-07 22:34:17 UTC
This PDF uses fonts using a non-standard encoding, being builtin to the fonts.

This makes it rather tricky to convert the PDF to Text, or to extract or copy'n' paste from it, or to get a screen-reader to read aloud the document in question! Not even Apple or Adobe get it completely right:

Pasting from Preview.app on a Mac, I get this:

  a ̈ o ̈ u ̈ ß • a ̈
  • o ̈ • u ̈ •ß
  1

Pasting from Adobe Acrobat Pro on a Mac, I get this:

  a o u 
   a
   o
   u
   
  1

Acrobat doesn't even display the bullet in front of the first item in the list (in front of the 'ä')!

Using Poppler's pdftotext -layout achieves this:

  a¨
  ¨ ou
     ¨ß
  
  • ¨
    a
  • ¨
    o
  • u
    ¨
  • ß

So even these have problems with getting pasting from LaTeX-originating PDFs right!

The reason is this: very frequently, LaTeX uses "digraphs" to create composite characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and 'odieresis' in PDF parlance) which are provided by non-LaTeX fonts.

To show you this, I uncompressed the page content stream and see this:

  5 0 obj
  <<
    /Length 570
  >>
  stream
  BT
  /F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td [<19>]TJ 158.626 -486.177 Td [(1)]TJ
  ET
  endstream
  endobj

What you can see here is that there is a frequent occurrence of the...

 ... <7f>, <19> and <0f> Hex character codes,
 ... these translate to '\177', '\031' and '\017' in Oktal, and
 ... translate to 'DEL', 'EM' and 'SI' in ASCII. 

I suspect one of these signs is meant to represent the 'ß' in the builtin font encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the umlauts. To investigate further, I used this command:

  grep -a CharSet poppler-bug#66569.pdf

It gave this output:

  /CharSet (/a/dieresis/germandbls/o/one/u)
  /CharSet (/bullet)

This confirms my suspicion: the embedded font 'CMR10' is subsetted to include only
the glyphs for

 * 'a'
 * 'o'
 * 'u'
 * 'dieresis'
 * 'germandbls' (german double s == ß)
 * 'one'  (at the bottom of the page the page number is shown)

the other, CMSY10, only has one glyph:

 * 'bullet'

LaTeX is good for preparing print- and read-ready PDF files. It is bad for creating PDFs which you want to make accessible: people who need accessibility features in their documents (f.e. to enable a screen reader) have the same problems as people who want to copy'n'paste from the documents.

----

Poppler may have many problems with copy'n'pasting text from PDFs. This issue here is not one of it...
Comment 3 Jason Crain 2015-04-17 04:32:37 UTC

*** This bug has been marked as a duplicate of bug 87215 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.