Summary: | Wrong selection with umlauts | ||
---|---|---|---|
Product: | poppler | Reporter: | Germán Poo-Caamaño <gpoo+bfdo> |
Component: | glib frontend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
URL: | https://bugzilla.gnome.org/show_bug.cgi?id=397650 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | PDF test with umlauts |
Description
Germán Poo-Caamaño
2013-07-04 03:14:15 UTC
Selection is ok for me, but after copying it copies: a ̈ o u ̈ ̈ ß This PDF uses fonts using a non-standard encoding, being builtin to the fonts. This makes it rather tricky to convert the PDF to Text, or to extract or copy'n' paste from it, or to get a screen-reader to read aloud the document in question! Not even Apple or Adobe get it completely right: Pasting from Preview.app on a Mac, I get this: a ̈ o ̈ u ̈ ß • a ̈ • o ̈ • u ̈ •ß 1 Pasting from Adobe Acrobat Pro on a Mac, I get this: a o u a o u 1 Acrobat doesn't even display the bullet in front of the first item in the list (in front of the 'ä')! Using Poppler's pdftotext -layout achieves this: a¨ ¨ ou ¨ß • ¨ a • ¨ o • u ¨ • ß So even these have problems with getting pasting from LaTeX-originating PDFs right! The reason is this: very frequently, LaTeX uses "digraphs" to create composite characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and 'odieresis' in PDF parlance) which are provided by non-LaTeX fonts. To show you this, I uncompressed the page content stream and see this: 5 0 obj << /Length 570 >> stream BT /F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td [<19>]TJ 158.626 -486.177 Td [(1)]TJ ET endstream endobj What you can see here is that there is a frequent occurrence of the... ... <7f>, <19> and <0f> Hex character codes, ... these translate to '\177', '\031' and '\017' in Oktal, and ... translate to 'DEL', 'EM' and 'SI' in ASCII. I suspect one of these signs is meant to represent the 'ß' in the builtin font encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the umlauts. To investigate further, I used this command: grep -a CharSet poppler-bug#66569.pdf It gave this output: /CharSet (/a/dieresis/germandbls/o/one/u) /CharSet (/bullet) This confirms my suspicion: the embedded font 'CMR10' is subsetted to include only the glyphs for * 'a' * 'o' * 'u' * 'dieresis' * 'germandbls' (german double s == ß) * 'one' (at the bottom of the page the page number is shown) the other, CMSY10, only has one glyph: * 'bullet' LaTeX is good for preparing print- and read-ready PDF files. It is bad for creating PDFs which you want to make accessible: people who need accessibility features in their documents (f.e. to enable a screen reader) have the same problems as people who want to copy'n'paste from the documents. ---- Poppler may have many problems with copy'n'pasting text from PDFs. This issue here is not one of it... |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.