Bug 35468

Summary: pdftotext cannot extract text from specific pdf
Product: poppler Reporter: Ulrich Leodolter <ulrich.leodolter>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: support identity-h ToUnicode

Description Ulrich Leodolter 2011-03-20 08:47:49 UTC
hi,

pdftotext fails to extract text from specific pdf (see attachment).
exit status is 0 and no warnings or errors are reported.
the output file contains only 99 page break characters (0x0c).

i am sure the pdf contains text because when is save the document
using acrobat reader as text then plenty of text is extracted and saved.

i can also view the document on linux (centos 5.5 and fedora core 14)
using evince without problems.

i tried the following versions, all gave the same result.

poppler 0.5.4   centos 5.5   x86_64
poppler 0.14.5  fedora fc14  x86_64
poppler git     fedora fc14  x86_64

best regards
ulrich
Comment 1 Ulrich Leodolter 2011-03-20 08:54:08 UTC
PDF upload failed, so please use this url:

http://share.obvsg.at/Bug-35468.pdf
Comment 2 Adrian Johnson 2012-02-22 04:41:17 UTC
Created attachment 57451 [details] [review]
support identity-h ToUnicode

The problem is all the ToUnicode maps are /Identity-H. The attached patch should fix this.
Comment 3 Albert Astals Cid 2012-02-23 13:57:51 UTC
Fixed with a different commit but similar in idea http://cgit.freedesktop.org/poppler/poppler/commit/?id=30446bdd7e202eed88d131e04477c76861fd145c

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.