97234 – Output text/html is unreadable for some PDFs

Bug 97234 - Output text/html is unreadable for some PDFs

Summary: Output text/html is unreadable for some PDFs

Status:	RESOLVED INVALID

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	pdftohtml (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-08-07 17:31 UTC by clark
Modified:	2016-08-07 22:51 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description clark 2016-08-07 17:31:24 UTC

These files returns unreadable text/html output

http://docdro.id/jrmCLNK
http://docdro.id/gj4xuvQ
http://docdro.id/3UfY0oz

pdftohtml -s -i input.pdf /output

Comment 1 Albert Astals Cid 2016-08-07 22:15:26 UTC

That's because the files are broken, not much we can do.

Comment 2 clark 2016-08-07 22:17:34 UTC

What do you mean by broken? If you open the files in a PDF reader everything seems fine? :)

Comment 3 clark 2016-08-07 22:21:59 UTC

If these files are broken then there is really alot of files there is broken.. I experience this with about 5% of all PDF files

Comment 4 Albert Astals Cid 2016-08-07 22:41:25 UTC

PDF is a display format, the fact that you can "see" text doesn't mean you can "extract" the text, pdf creators need to do stuff correctly for text extraction to work, lots of them are broken as hell.

To convince me this is a bug in our side you'll have to how me a PDF tool that can extract text from the files (that is not using OCR of course).

Comment 5 clark 2016-08-07 22:51:53 UTC

ok.. thanks for the answer :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.