Bug 97234

Summary:	Output text/html is unreadable for some PDFs
Product:	poppler	Reporter:	clark
Component:	pdftohtml	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED INVALID	QA Contact:
Severity:	critical
Priority:	medium
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:

Description clark 2016-08-07 17:31:24 UTC

These files returns unreadable text/html output

http://docdro.id/jrmCLNK
http://docdro.id/gj4xuvQ
http://docdro.id/3UfY0oz

pdftohtml -s -i input.pdf /output

Comment 1 Albert Astals Cid 2016-08-07 22:15:26 UTC

That's because the files are broken, not much we can do.

Comment 2 clark 2016-08-07 22:17:34 UTC

What do you mean by broken? If you open the files in a PDF reader everything seems fine? :)

Comment 3 clark 2016-08-07 22:21:59 UTC

If these files are broken then there is really alot of files there is broken.. I experience this with about 5% of all PDF files

Comment 4 Albert Astals Cid 2016-08-07 22:41:25 UTC

PDF is a display format, the fact that you can "see" text doesn't mean you can "extract" the text, pdf creators need to do stuff correctly for text extraction to work, lots of them are broken as hell.

To convince me this is a bug in our side you'll have to how me a PDF tool that can extract text from the files (that is not using OCR of course).

Comment 5 clark 2016-08-07 22:51:53 UTC

ok.. thanks for the answer :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.