101770 – pdftohtml: don't put control characters in output

Bug 101770 - pdftohtml: don't put control characters in output

Summary: pdftohtml: don't put control characters in output

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	pdftohtml (show other bugs)
Version:	unspecified
Hardware:	All Mac OS X (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-12 21:26 UTC by Mdia
Modified:	2017-07-31 12:52 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Character (876 bytes, image/png) 2017-07-12 21:26 UTC, Mdia	Details
Document example (64.98 KB, application/pdf) 2017-07-13 09:05 UTC, Mdia	Details
skip-control-characters.patch (853 bytes, patch) 2017-07-16 17:35 UTC, Jason Crain	Details \| Splinter Review
skip-control-characters.patch (853 bytes, patch) 2017-07-16 19:11 UTC, Jason Crain	Details \| Splinter Review
Show Obsolete (1) View All

Description Mdia 2017-07-12 21:26:13 UTC

Created attachment 132647 [details]
Character

My document contains more "triangle" characters, after copying I see "", when I try to convert pdf to html I get following error message:

DOMDocument::loadHTML(): Invalid char in CDATA 0x1 in Entity, line: ...

It is 100% on this triangle char because if the document don't contains this char converting is good.

If I output html before I call it with "DOMDocument" then the content is 4x duplicated where this char is placed...

Is is possible to replace this char with empty sting or to avoid duplicated content...

It is on win. and macos...

Comment 1 Jason Crain 2017-07-12 23:05:44 UTC

Hard to tell what's wrong without the PDF, but I'd guess that your document doesn't provide a good encoding table so you end up with a character 0x01 or some other invisible character. I don't know how you are converting to HTML but your conversion utility should probably be removing or escaping those characters.

Comment 2 Mdia 2017-07-13 09:05:11 UTC

Created attachment 132659 [details]
Document example

Here is a pdf document example

Comment 3 Jason Crain 2017-07-13 15:33:25 UTC

Yes, the ODNMDG+TT1AFt00 font, where the triangle comes from, doesn't have a charmap or known name mapping so poppler is guessing what the character should be and it guesses 0x01.  I don't see why that would really be a problem because AFAIK 0x01 is not a forbidden HTML character.

How are you converting to HTML?  The error message "DOMDocument::loadHTML(): Invalid char in CDATA 0x1 in Entity, line:" is not from any poppler utility.

Comment 4 Mdia 2017-07-13 17:26:10 UTC

I know what you mean, I use this tool: https://packagist.org/packages/gufy/pdftohtml-php, if I go in this library and output converted content, then on places where I can find trinagle the content is 4x duplicated...

Comment 5 Jason Crain 2017-07-13 18:07:25 UTC

I'm not an expert on PHP but it looks like that is calling out to poppler's pdftohtml and PHP seems to not like control characters in HTML.  I also found this secion in a W3C working draft:

https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#text-0
Text must not contain U+0000 characters. Text must not contain permanently undefined Unicode characters (noncharacters). Text must not contain control characters other than space characters.

So pdftohtml should probably not be putting control characters in its output.

Comment 6 Jason Crain 2017-07-14 21:12:55 UTC

Wait.  What do you mean by 4x duplicated content?  I worry that you are conflating two completely separate issues.

Comment 7 Mdia 2017-07-16 14:56:06 UTC

On all places where you can find a triangle, after conversion then content is 4x duplicated like:
---
icon - some title
---

after conversion I see:
---
some title
some title
some title
some title
---

Comment 8 Jason Crain 2017-07-16 17:35:12 UTC

Created attachment 132715 [details] [review]
skip-control-characters.patch

Does this patch fix it for you? Does it just fix the PHP warnings?

Comment 9 Jason Crain 2017-07-16 19:11:32 UTC

Created attachment 132717 [details] [review]
skip-control-characters.patch

Sorry, I had attached the wrong patch.  This one should remove control characters.

Comment 10 Jason Crain 2017-07-16 19:44:39 UTC

I've created a new bug #101807 for the duplicated text because I'm pretty sure this patch is not going to fix that for you.  I don't have the time or interest in pdftohtml to fix it myself, but someone else could use TextOutputDev's fakebold and dropshadow detection and implement something like that for pdftohtml.

Comment 11 Mdia 2017-07-17 14:18:06 UTC

(In reply to Jason Crain from comment #9)
> Created attachment 132717 [details] [review] [review]
> skip-control-characters.patch
> 
> Sorry, I had attached the wrong patch.  This one should remove control
> characters.

Is your patch somewhere to implement or how can I use it?

Comment 12 Jason Crain 2017-07-17 14:54:08 UTC

(In reply to Mdia from comment #11)
> Is your patch somewhere to implement or how can I use it?

It's linked to in comment #9 or in the attachments section near the top of the Bugzilla page.  You apply it to the poppler source code and recompile.

Comment 13 Albert Astals Cid 2017-07-31 12:52:08 UTC

Pushed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.