Bug 50002

Summary: pdftohtml writes invalid HTML
Product: poppler Reporter: geralds <solahcin>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: solahcin
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Patch for utils/HtmlOutputDev.cc
patch for invalid xhtml output
patch for invalid xhtml output

Description geralds 2012-05-16 06:05:59 UTC
Created attachment 61713 [details] [review]
Patch for utils/HtmlOutputDev.cc

Patch against r0.20.0 is attached.

The element names output by pdftohtml take upper case, which is  not valid to the DTD and so rejected by epubcheck and other tools downstream.

The <hr> and <frame> elements are missing closing tags or abbreviated empty tag notation (<hr/>, <frame/>).

These errors are fixed by patch.txt applied to utils/HtmlOutputDev.cc.

Tested on CentOS Linux server against source built from r0.20.0 tarball.
Comment 1 Albert Astals Cid 2012-05-21 15:02:15 UTC
Can you please attach the "diff -u" output, it's much easier to read.
Comment 2 geralds 2012-05-23 02:25:21 UTC
Created attachment 62006 [details] [review]
patch for invalid xhtml output

diff -u output attached
Comment 3 geralds 2012-05-23 02:44:29 UTC
Created attachment 62007 [details] [review]
patch for invalid xhtml output

-u patch v2
Comment 4 Albert Astals Cid 2012-05-23 13:14:19 UTC
Hi, i'm going to need your full name so i can put it correctly when commiting the patch.
Comment 5 geralds 2012-05-25 03:03:50 UTC
Thanks Albert, it's Gerald Schmidt.

(In reply to comment #4)
> Hi, i'm going to need your full name so i can put it correctly when commiting
> the patch.
Comment 6 Albert Astals Cid 2012-05-26 08:47:24 UTC
Pushed it to master, will be part of poppler 0.22

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.