Bug 35468 - pdftotext cannot extract text from specific pdf
Summary: pdftotext cannot extract text from specific pdf
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-20 08:47 UTC by Ulrich Leodolter
Modified: 2012-02-23 13:57 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
support identity-h ToUnicode (2.71 KB, patch)
2012-02-22 04:41 UTC, Adrian Johnson
Details | Splinter Review

Description Ulrich Leodolter 2011-03-20 08:47:49 UTC
hi,

pdftotext fails to extract text from specific pdf (see attachment).
exit status is 0 and no warnings or errors are reported.
the output file contains only 99 page break characters (0x0c).

i am sure the pdf contains text because when is save the document
using acrobat reader as text then plenty of text is extracted and saved.

i can also view the document on linux (centos 5.5 and fedora core 14)
using evince without problems.

i tried the following versions, all gave the same result.

poppler 0.5.4   centos 5.5   x86_64
poppler 0.14.5  fedora fc14  x86_64
poppler git     fedora fc14  x86_64

best regards
ulrich
Comment 1 Ulrich Leodolter 2011-03-20 08:54:08 UTC
PDF upload failed, so please use this url:

http://share.obvsg.at/Bug-35468.pdf
Comment 2 Adrian Johnson 2012-02-22 04:41:17 UTC
Created attachment 57451 [details] [review]
support identity-h ToUnicode

The problem is all the ToUnicode maps are /Identity-H. The attached patch should fix this.
Comment 3 Albert Astals Cid 2012-02-23 13:57:51 UTC
Fixed with a different commit but similar in idea http://cgit.freedesktop.org/poppler/poppler/commit/?id=30446bdd7e202eed88d131e04477c76861fd145c


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.