I received a pdf file created by pdflatex / miktex on windows. Running pdftotext 0.24.3 on it produces very strange results: original text: Die Organisation der Flugrettung in Österreich pdftotext output: ❉✐❡ ❖r❣❛♥✐s❛t✐♦♥ ❞❡r ❋❧✉❣r❡tt✉♥❣ ✐♥ Öst❡rr❡✐❝❤ Some sort of glyph conversion problem? The problem is reproducible. I installed miktex on my windows VM, ran pdflatex on the tex file and got the same results. I did a number of checks on the pdf with adobe acrobat and various tools and it seems valid. Unfortunately, I do not own the file and cannot attach it here. pdfinfo says: Creator: LaTeX with hyperref package Producer: pdfTeX-1.40.12 pdffonts output: [none] Type 3 Custom yes no no 303 0 [none] Type 3 Custom yes no no 304 0 [none] Type 3 Custom yes no no 305 0 [none] Type 3 Custom yes no no 306 0 UEIZYW+CMSY10 Type 1 Builtin yes yes no 307 0 FRNIHB+CMSY8 Type 1 Builtin yes yes no 308 0 [none] Type 3 Custom yes no no 348 0 [none] Type 3 Custom yes no no 349 0 [none] Type 3 Custom yes no no 350 0 [none] Type 3 Custom yes no no 435 0 [none] Type 3 Custom yes no no 436 0 [none] Type 3 Custom yes no no 470 0 [none] Type 3 Custom yes no no 480 0 Nothing unusual in the tex file either. \usepackage[german]{babel} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{amssymb, amsmath} %arithmetic symbols, math enhancement package \usepackage[round]{natbib} %bibliography \usepackage{graphicx} %figures \usepackage{hyperref} %hyperlinks %% additional packages \usepackage{tipa} %phonetics Strangely this problem does not occur if I create the pdf on Linux.
> Strangely this problem does not occur if I create the pdf on Linux. And why did you open against poppler and not against what creates the pdf file?
(In reply to comment #1) > > Strangely this problem does not occur if I create the pdf on Linux. > > And why did you open against poppler and not against what creates the pdf > file? Analysis of the pdf did not reveal anything weird. Adobe stuff and pdf2txt.py have no troubles converting it to text, so I deduced it to be rather a poppler issue.
Ok, if you have a pdf that Adobe Reader can extract the text and poppler can't, please attach it.
Created attachment 89960 [details] pdf that results in Mojibake This is pdf created by miktex on windows. I have no rights to its content and had most of it removed.
bug #60243 again with ZapfDingbats character names :) These producers really need to start including CMaps.
(In reply to comment #5) > These producers really need to start including CMaps. And stop using Type 3 fonts.
(In reply to comment #5) > bug #60243 again with ZapfDingbats character names :) > > These producers really need to start including CMaps. Thanks! I applied your patch and recompiled poppler 0.24.4. Works like a charm. f ligatures don't seem to be displayed correctly (like fl in flying, I cannot paste the resulting char here), but I can live with that, I only need the pdf's content for language detection + indexing and strip all non-printable chars before passing the content to the indexer.
*** This bug has been marked as a duplicate of bug 60243 ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.