35468 – pdftotext cannot extract text from specific pdf

Bug 35468 - pdftotext cannot extract text from specific pdf

Summary: pdftotext cannot extract text from specific pdf

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-03-20 08:47 UTC by Ulrich Leodolter
Modified:	2012-02-23 13:57 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
support identity-h ToUnicode (2.71 KB, patch) 2012-02-22 04:41 UTC, Adrian Johnson	Details \| Splinter Review
View All

Description Ulrich Leodolter 2011-03-20 08:47:49 UTC

hi,

pdftotext fails to extract text from specific pdf (see attachment).
exit status is 0 and no warnings or errors are reported.
the output file contains only 99 page break characters (0x0c).

i am sure the pdf contains text because when is save the document
using acrobat reader as text then plenty of text is extracted and saved.

i can also view the document on linux (centos 5.5 and fedora core 14)
using evince without problems.

i tried the following versions, all gave the same result.

poppler 0.5.4   centos 5.5   x86_64
poppler 0.14.5  fedora fc14  x86_64
poppler git     fedora fc14  x86_64

best regards
ulrich

Comment 1 Ulrich Leodolter 2011-03-20 08:54:08 UTC

PDF upload failed, so please use this url:

http://share.obvsg.at/Bug-35468.pdf

Comment 2 Adrian Johnson 2012-02-22 04:41:17 UTC

Created attachment 57451 [details] [review]
support identity-h ToUnicode

The problem is all the ToUnicode maps are /Identity-H. The attached patch should fix this.

Comment 3 Albert Astals Cid 2012-02-23 13:57:51 UTC

Fixed with a different commit but similar in idea http://cgit.freedesktop.org/poppler/poppler/commit/?id=30446bdd7e202eed88d131e04477c76861fd145c

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.