Summary: | pdftohtml: don't put control characters in output | ||
---|---|---|---|
Product: | poppler | Reporter: | Mdia <media-x> |
Component: | pdftohtml | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Mac OS X (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Character
Document example skip-control-characters.patch skip-control-characters.patch |
Hard to tell what's wrong without the PDF, but I'd guess that your document doesn't provide a good encoding table so you end up with a character 0x01 or some other invisible character. I don't know how you are converting to HTML but your conversion utility should probably be removing or escaping those characters. Created attachment 132659 [details]
Document example
Here is a pdf document example
Yes, the ODNMDG+TT1AFt00 font, where the triangle comes from, doesn't have a charmap or known name mapping so poppler is guessing what the character should be and it guesses 0x01. I don't see why that would really be a problem because AFAIK 0x01 is not a forbidden HTML character. How are you converting to HTML? The error message "DOMDocument::loadHTML(): Invalid char in CDATA 0x1 in Entity, line:" is not from any poppler utility. I know what you mean, I use this tool: https://packagist.org/packages/gufy/pdftohtml-php, if I go in this library and output converted content, then on places where I can find trinagle the content is 4x duplicated... I'm not an expert on PHP but it looks like that is calling out to poppler's pdftohtml and PHP seems to not like control characters in HTML. I also found this secion in a W3C working draft: https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#text-0 Text must not contain U+0000 characters. Text must not contain permanently undefined Unicode characters (noncharacters). Text must not contain control characters other than space characters. So pdftohtml should probably not be putting control characters in its output. Wait. What do you mean by 4x duplicated content? I worry that you are conflating two completely separate issues. On all places where you can find a triangle, after conversion then content is 4x duplicated like: --- icon - some title --- after conversion I see: --- some title some title some title some title --- Created attachment 132715 [details] [review] skip-control-characters.patch Does this patch fix it for you? Does it just fix the PHP warnings? Created attachment 132717 [details] [review] skip-control-characters.patch Sorry, I had attached the wrong patch. This one should remove control characters. I've created a new bug #101807 for the duplicated text because I'm pretty sure this patch is not going to fix that for you. I don't have the time or interest in pdftohtml to fix it myself, but someone else could use TextOutputDev's fakebold and dropshadow detection and implement something like that for pdftohtml. (In reply to Jason Crain from comment #9) > Created attachment 132717 [details] [review] [review] > skip-control-characters.patch > > Sorry, I had attached the wrong patch. This one should remove control > characters. Is your patch somewhere to implement or how can I use it? (In reply to Mdia from comment #11) > Is your patch somewhere to implement or how can I use it? It's linked to in comment #9 or in the attachments section near the top of the Bugzilla page. You apply it to the poppler source code and recompile. Pushed |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 132647 [details] Character My document contains more "triangle" characters, after copying I see "", when I try to convert pdf to html I get following error message: DOMDocument::loadHTML(): Invalid char in CDATA 0x1 in Entity, line: ... It is 100% on this triangle char because if the document don't contains this char converting is good. If I output html before I call it with "DOMDocument" then the content is 4x duplicated where this char is placed... Is is possible to replace this char with empty sting or to avoid duplicated content... It is on win. and macos...