pdftotext fails to extract text on several PDF files. These are part of my electronic library. pdftotext works flawlessly on the other 400 files of the same type, but there are about 5 files which produce no output at all (example 1) or data gibberish (example 2). I'm using "pdftotext version 0.10.6" on Gentoo Linux (x86-64 arch).
PDF examples are under: http://92.43.104.34/pdf/
Not a bug, the pdf doesn't contain the correct font -> text mappings, see that Adobe Reader can't extract the text either.
I see. My (because of my ignorance of PDF interna perhaps naive) belief was, that if acroread/xpdf/... can render it (deterministic every time ;-)) and select text, the information must be in that document. So no chance to reconstruct that broken doc? Would be nice if pdftotext had some --tryrealhard switch. Anyway thanks for your consideration, keep up the good work.
no chance, unless you go the OCR route
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.