Bug 21775

Summary:	pdftotext text extraction failure
Product:	poppler	Reporter:	Rico <risanecek>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED NOTABUG	QA Contact:
Severity:	major
Priority:	medium
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:

Description Rico 2009-05-17 03:24:38 UTC

pdftotext fails to extract text on several PDF files. These are part of my electronic library. pdftotext works flawlessly on the other 400 files of the same type, but there are about 5 files which produce no output at all (example 1) or data gibberish (example 2).

I'm using "pdftotext version 0.10.6" on Gentoo Linux (x86-64 arch).

Comment 1 Rico 2009-05-17 05:10:53 UTC

PDF examples are under:

http://92.43.104.34/pdf/

Comment 2 Albert Astals Cid 2009-05-17 10:56:54 UTC

Not a bug, the pdf doesn't contain the correct font -> text mappings, see that Adobe Reader can't extract the text either.

Comment 3 Rico 2009-05-17 23:02:43 UTC

I see. My (because of my ignorance of PDF interna perhaps naive) belief was, that if acroread/xpdf/... can render it (deterministic every time ;-)) and select text, the information must be in that document.

So no chance to reconstruct that broken doc? Would be nice if pdftotext had some --tryrealhard switch.

Anyway thanks for your consideration, keep up the good work.

Comment 4 Albert Astals Cid 2009-05-18 11:09:51 UTC

no chance, unless you go the OCR route

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.