Bug 35468

Summary:	pdftotext cannot extract text from specific pdf
Product:	poppler	Reporter:	Ulrich Leodolter <ulrich.leodolter>
Component:	utils	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	major
Priority:	medium
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	support identity-h ToUnicode

Description Ulrich Leodolter 2011-03-20 08:47:49 UTC

hi,

pdftotext fails to extract text from specific pdf (see attachment).
exit status is 0 and no warnings or errors are reported.
the output file contains only 99 page break characters (0x0c).

i am sure the pdf contains text because when is save the document
using acrobat reader as text then plenty of text is extracted and saved.

i can also view the document on linux (centos 5.5 and fedora core 14)
using evince without problems.

i tried the following versions, all gave the same result.

poppler 0.5.4   centos 5.5   x86_64
poppler 0.14.5  fedora fc14  x86_64
poppler git     fedora fc14  x86_64

best regards
ulrich

Comment 1 Ulrich Leodolter 2011-03-20 08:54:08 UTC

PDF upload failed, so please use this url:

http://share.obvsg.at/Bug-35468.pdf

Comment 2 Adrian Johnson 2012-02-22 04:41:17 UTC

Created attachment 57451 [details] [review]
support identity-h ToUnicode

The problem is all the ToUnicode maps are /Identity-H. The attached patch should fix this.

Comment 3 Albert Astals Cid 2012-02-23 13:57:51 UTC

Fixed with a different commit but similar in idea http://cgit.freedesktop.org/poppler/poppler/commit/?id=30446bdd7e202eed88d131e04477c76861fd145c

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.