Bug 97234 - Output text/html is unreadable for some PDFs
Summary: Output text/html is unreadable for some PDFs
Status: RESOLVED INVALID
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-07 17:31 UTC by clark
Modified: 2016-08-07 22:51 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description clark 2016-08-07 17:31:24 UTC
These files returns unreadable text/html output

http://docdro.id/jrmCLNK
http://docdro.id/gj4xuvQ
http://docdro.id/3UfY0oz

pdftohtml -s -i input.pdf /output
Comment 1 Albert Astals Cid 2016-08-07 22:15:26 UTC
That's because the files are broken, not much we can do.
Comment 2 clark 2016-08-07 22:17:34 UTC
What do you mean by broken? If you open the files in a PDF reader everything seems fine? :)
Comment 3 clark 2016-08-07 22:21:59 UTC
If these files are broken then there is really alot of files there is broken.. I experience this with about 5% of all PDF files
Comment 4 Albert Astals Cid 2016-08-07 22:41:25 UTC
PDF is a display format, the fact that you can "see" text doesn't mean you can "extract" the text, pdf creators need to do stuff correctly for text extraction to work, lots of them are broken as hell.

To convince me this is a bug in our side you'll have to how me a PDF tool that can extract text from the files (that is not using OCR of course).
Comment 5 clark 2016-08-07 22:51:53 UTC
ok.. thanks for the answer :)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.