21775 – pdftotext text extraction failure

Bug 21775 - pdftotext text extraction failure

Summary: pdftotext text extraction failure

Status:	RESOLVED NOTABUG

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-05-17 03:24 UTC by Rico
Modified:	2009-05-18 11:09 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Rico 2009-05-17 03:24:38 UTC

pdftotext fails to extract text on several PDF files. These are part of my electronic library. pdftotext works flawlessly on the other 400 files of the same type, but there are about 5 files which produce no output at all (example 1) or data gibberish (example 2).

I'm using "pdftotext version 0.10.6" on Gentoo Linux (x86-64 arch).

Comment 1 Rico 2009-05-17 05:10:53 UTC

PDF examples are under:

http://92.43.104.34/pdf/

Comment 2 Albert Astals Cid 2009-05-17 10:56:54 UTC

Not a bug, the pdf doesn't contain the correct font -> text mappings, see that Adobe Reader can't extract the text either.

Comment 3 Rico 2009-05-17 23:02:43 UTC

I see. My (because of my ignorance of PDF interna perhaps naive) belief was, that if acroread/xpdf/... can render it (deterministic every time ;-)) and select text, the information must be in that document.

So no chance to reconstruct that broken doc? Would be nice if pdftotext had some --tryrealhard switch.

Anyway thanks for your consideration, keep up the good work.

Comment 4 Albert Astals Cid 2009-05-18 11:09:51 UTC

no chance, unless you go the OCR route

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.