Created attachment 113678 [details] Source PDF When converting the attached PDF to XML using version 0.29.0 $ pdftohtml -xml in.pdf out.xml It produces invalidly nested tags in the index portion near the end: $ xmllint out.xml out.xml:16770: parser error : Opening and ending tag mismatch: a line 16770 and b font="11"><b>Abrüstung<a href="out.xml#314"> </b><i>314, </i></a> Looking at the source document (page 320/327) the closing </b> should occur before the opening <a>, the numbers are linked and not bold. Oddly the error doesn't occur for all entries.
Created attachment 113679 [details] Smaller testcase produced using `pdfseparate -f 320 -l 320`
Created attachment 113680 [details] Invalid output Output of `pdftohtml -xml`
Created attachment 114252 [details] Patch to fix problems with xml wellformedness This fix is available at https://github.com/albbas/poppler in the fix_xml_wellformedness branch.
Created attachment 114253 [details] Diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on out.pdf
Created attachment 114254 [details] 656 lines long diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on 1078 pdf docs
Created attachment 114282 [details] pdf that produces xml with illegal xml chars
Created attachment 114283 [details] xmllint output for xml-doc produced by pdftohtml 0.30.0 The output of xmllint doc-with-strange-chars.xml doc-with-strange-chars.xml was produced with pdftohtml version 0.30.0 The errors reported are both illegal char values, opening and ending tag mismatch and premature end of tag
Created attachment 114284 [details] Fix opening and ending tag mismatch when resulting xml document contains invalid xml chars The difference between this patch and patch 114242 is this: diff --git a/utils/HtmlOutputDev.cc b/utils/HtmlOutputDev.cc index d725578..4915030 100644 --- a/utils/HtmlOutputDev.cc +++ b/utils/HtmlOutputDev.cc @@ -480,14 +480,16 @@ static bool tag_exists( std::list<std::string> tags, std::string tag ) static void CloseTag(GooString *htext, std::list<std::string> &tags, std::string tag) { + size_t index = strlen(htext->getCString()); while( !tags.empty() && tags.back() != tag ) { std::string current_tag = tags.back(); - htext->append(current_tag.c_str(), current_tag.length()); + htext->insert(index, current_tag.c_str()); + index += current_tag.length(); tags.pop_back(); } if( !tags.empty()) { std::string current_tag = tags.back(); - htext->append(current_tag.c_str(), current_tag.length()); + htext->insert(index, current_tag.c_str()); tags.pop_back(); } } When the htext variable contains what produces the "PCDATA invalid Char value" errors in xmllint, the append function does not work as it should. To force the ending tags to be appended to the GooString, the insert function is used instead. This produces output that does not have the "opening and ending mismatch" and "premature end of data in tag" errors when ran through xmllint.
Sorry for the delay in review. The patch fixes tag orders but it's breaking links, in your own document, with the unpatched pdftohtml, the 144 in Alleinerziehende is a link, with your patch, it is not, that is wrong since in the original it is. Please fix that.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/577.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.