Bug 72127

Summary: Mojibake when converting a pdf created with miktex on win
Product: poppler Reporter: scriabin <seinsvergessen>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED DUPLICATE QA Contact:
Severity: normal    
Priority: medium CC: jason
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: pdf that results in Mojibake

Description scriabin 2013-11-28 15:35:30 UTC
I received a pdf file created by pdflatex / miktex on windows.

Running pdftotext 0.24.3 on it produces very strange results:

original text:
Die Organisation der Flugrettung in
Österreich 

pdftotext output:
❉✐❡ ❖r❣❛♥✐s❛t✐♦♥ ❞❡r ❋❧✉❣r❡tt✉♥❣ ✐♥
Öst❡rr❡✐❝❤ 

Some sort of glyph conversion problem?

The problem is reproducible. I installed miktex on my windows VM, ran pdflatex on the tex file and got the same results.

I did a number of checks on the pdf with adobe acrobat and various tools and it seems valid. Unfortunately, I do not own the file and cannot attach it here.

pdfinfo says: 
Creator:        LaTeX with hyperref package
Producer:       pdfTeX-1.40.12

pdffonts output:
[none]                               Type 3            Custom           yes no  no     303  0
[none]                               Type 3            Custom           yes no  no     304  0
[none]                               Type 3            Custom           yes no  no     305  0
[none]                               Type 3            Custom           yes no  no     306  0
UEIZYW+CMSY10                        Type 1            Builtin          yes yes no     307  0
FRNIHB+CMSY8                         Type 1            Builtin          yes yes no     308  0
[none]                               Type 3            Custom           yes no  no     348  0
[none]                               Type 3            Custom           yes no  no     349  0
[none]                               Type 3            Custom           yes no  no     350  0
[none]                               Type 3            Custom           yes no  no     435  0
[none]                               Type 3            Custom           yes no  no     436  0
[none]                               Type 3            Custom           yes no  no     470  0
[none]                               Type 3            Custom           yes no  no     480  0



Nothing unusual in the tex file either.

\usepackage[german]{babel} 
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amssymb, amsmath} %arithmetic symbols, math enhancement package
\usepackage[round]{natbib} %bibliography
\usepackage{graphicx} %figures
\usepackage{hyperref} %hyperlinks
%% additional packages
\usepackage{tipa} %phonetics

Strangely this problem does not occur if I create the pdf on Linux.
Comment 1 Albert Astals Cid 2013-11-28 15:39:58 UTC
> Strangely this problem does not occur if I create the pdf on Linux.

And why did you open against poppler and not against what creates the pdf file?
Comment 2 scriabin 2013-11-28 15:45:53 UTC
(In reply to comment #1)
> > Strangely this problem does not occur if I create the pdf on Linux.
> 
> And why did you open against poppler and not against what creates the pdf
> file?

Analysis of the pdf did not reveal anything weird. Adobe stuff and pdf2txt.py have no troubles converting it to text, so I deduced it to be rather a poppler issue.
Comment 3 Albert Astals Cid 2013-11-28 15:48:48 UTC
Ok, if you have a pdf that Adobe Reader can extract the text and poppler can't, please attach it.
Comment 4 scriabin 2013-11-28 16:04:19 UTC
Created attachment 89960 [details]
pdf that results in Mojibake

This is pdf created by miktex on windows. I have no rights to its content and had most of it removed.
Comment 5 Jason Crain 2013-11-28 20:30:50 UTC
bug #60243 again with ZapfDingbats character names :)

These producers really need to start including CMaps.
Comment 6 Adrian Johnson 2013-11-28 20:55:07 UTC
(In reply to comment #5)
> These producers really need to start including CMaps.

And stop using Type 3 fonts.
Comment 7 scriabin 2013-11-28 21:28:21 UTC
(In reply to comment #5)
> bug #60243 again with ZapfDingbats character names :)
> 
> These producers really need to start including CMaps.

Thanks!

I applied your patch and recompiled poppler 0.24.4. Works like a charm.

f ligatures don't seem to be displayed correctly (like fl in flying, I cannot paste the resulting char here), but I can live with that, I only need the pdf's content for language detection + indexing and strip all non-printable chars before passing the content to the indexer.
Comment 8 Albert Astals Cid 2013-11-28 23:42:24 UTC

*** This bug has been marked as a duplicate of bug 60243 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.