Bug 21775

Summary: pdftotext text extraction failure
Product: poppler Reporter: Rico <risanecek>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED NOTABUG QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Rico 2009-05-17 03:24:38 UTC
pdftotext fails to extract text on several PDF files. These are part of my electronic library. pdftotext works flawlessly on the other 400 files of the same type, but there are about 5 files which produce no output at all (example 1) or data gibberish (example 2).

I'm using "pdftotext version 0.10.6" on Gentoo Linux (x86-64 arch).
Comment 1 Rico 2009-05-17 05:10:53 UTC
PDF examples are under:

http://92.43.104.34/pdf/
Comment 2 Albert Astals Cid 2009-05-17 10:56:54 UTC
Not a bug, the pdf doesn't contain the correct font -> text mappings, see that Adobe Reader can't extract the text either.
Comment 3 Rico 2009-05-17 23:02:43 UTC
I see. My (because of my ignorance of PDF interna perhaps naive) belief was, that if acroread/xpdf/... can render it (deterministic every time ;-)) and select text, the information must be in that document.

So no chance to reconstruct that broken doc? Would be nice if pdftotext had some --tryrealhard switch.

Anyway thanks for your consideration, keep up the good work.

Comment 4 Albert Astals Cid 2009-05-18 11:09:51 UTC
no chance, unless you go the OCR route

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.