Bug 97234

Summary: Output text/html is unreadable for some PDFs
Product: poppler Reporter: clark
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED INVALID QA Contact:
Severity: critical    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description clark 2016-08-07 17:31:24 UTC
These files returns unreadable text/html output

http://docdro.id/jrmCLNK
http://docdro.id/gj4xuvQ
http://docdro.id/3UfY0oz

pdftohtml -s -i input.pdf /output
Comment 1 Albert Astals Cid 2016-08-07 22:15:26 UTC
That's because the files are broken, not much we can do.
Comment 2 clark 2016-08-07 22:17:34 UTC
What do you mean by broken? If you open the files in a PDF reader everything seems fine? :)
Comment 3 clark 2016-08-07 22:21:59 UTC
If these files are broken then there is really alot of files there is broken.. I experience this with about 5% of all PDF files
Comment 4 Albert Astals Cid 2016-08-07 22:41:25 UTC
PDF is a display format, the fact that you can "see" text doesn't mean you can "extract" the text, pdf creators need to do stuff correctly for text extraction to work, lots of them are broken as hell.

To convince me this is a bug in our side you'll have to how me a PDF tool that can extract text from the files (that is not using OCR of course).
Comment 5 clark 2016-08-07 22:51:53 UTC
ok.. thanks for the answer :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.