Bug 89239 - pdftohtml produces wrongly nested tags
Summary: pdftohtml produces wrongly nested tags
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-20 01:39 UTC by pascal
Modified: 2018-08-21 11:14 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Source PDF (1.99 MB, text/plain)
2015-02-20 01:39 UTC, pascal
Details
Smaller testcase (76.08 KB, application/pdf)
2015-02-20 01:43 UTC, pascal
Details
Invalid output (5.79 KB, application/xml)
2015-02-20 01:44 UTC, pascal
Details
Patch to fix problems with xml wellformedness (6.75 KB, text/plain)
2015-03-12 10:36 UTC, albbas
Details
Diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on out.pdf (7.76 KB, text/plain)
2015-03-12 10:41 UTC, albbas
Details
656 lines long diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on 1078 pdf docs (46.91 KB, text/plain)
2015-03-12 10:49 UTC, albbas
Details
pdf that produces xml with illegal xml chars (115.74 KB, text/plain)
2015-03-13 10:54 UTC, albbas
Details
xmllint output for xml-doc produced by pdftohtml 0.30.0 (28.42 KB, text/plain)
2015-03-13 11:10 UTC, albbas
Details
Fix opening and ending tag mismatch when resulting xml document contains invalid xml chars (6.82 KB, text/plain)
2015-03-13 11:29 UTC, albbas
Details

Description pascal 2015-02-20 01:39:05 UTC
Created attachment 113678 [details]
Source PDF

When converting the attached PDF to XML using version 0.29.0
$ pdftohtml -xml in.pdf out.xml 

It produces invalidly nested tags in the index portion near the end:
$ xmllint out.xml
out.xml:16770: parser error : Opening and ending tag mismatch: a line 16770 and b
font="11"><b>Abrüstung<a href="out.xml#314"> </b><i>314, </i></a>

Looking at the source document (page 320/327) the closing </b> should occur before the opening <a>, the numbers are linked and not bold.
Oddly the error doesn't occur for all entries.
Comment 1 pascal 2015-02-20 01:43:19 UTC
Created attachment 113679 [details]
Smaller testcase

produced using `pdfseparate -f 320 -l 320`
Comment 2 pascal 2015-02-20 01:44:54 UTC
Created attachment 113680 [details]
Invalid output

Output of `pdftohtml -xml`
Comment 3 albbas 2015-03-12 10:36:07 UTC
Created attachment 114252 [details]
Patch to fix problems with xml wellformedness

This fix is available at https://github.com/albbas/poppler in the fix_xml_wellformedness branch.
Comment 4 albbas 2015-03-12 10:41:13 UTC
Created attachment 114253 [details]
Diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on out.pdf
Comment 5 albbas 2015-03-12 10:49:02 UTC
Created attachment 114254 [details]
656 lines long diff showing the output of pdftohtml 0.30.0 versus the patched pdftohtml on 1078 pdf docs
Comment 6 albbas 2015-03-13 10:54:38 UTC
Created attachment 114282 [details]
pdf that produces xml with illegal xml chars
Comment 7 albbas 2015-03-13 11:10:20 UTC
Created attachment 114283 [details]
xmllint output for xml-doc produced by pdftohtml 0.30.0

The output of xmllint doc-with-strange-chars.xml

doc-with-strange-chars.xml was produced with pdftohtml version 0.30.0

The errors reported are both illegal char values, opening and ending tag mismatch and premature end of tag
Comment 8 albbas 2015-03-13 11:29:01 UTC
Created attachment 114284 [details]
Fix opening and ending tag mismatch when resulting xml document contains invalid xml chars

The difference between this patch and patch 114242 is this:
diff --git a/utils/HtmlOutputDev.cc b/utils/HtmlOutputDev.cc
index d725578..4915030 100644
--- a/utils/HtmlOutputDev.cc
+++ b/utils/HtmlOutputDev.cc
@@ -480,14 +480,16 @@ static bool tag_exists( std::list<std::string> tags, std::string tag )
 
 static void CloseTag(GooString *htext, std::list<std::string> &tags, std::string tag)
 {
+    size_t index = strlen(htext->getCString());
     while( !tags.empty() && tags.back() != tag ) {
         std::string current_tag = tags.back();
-        htext->append(current_tag.c_str(), current_tag.length());
+        htext->insert(index, current_tag.c_str());
+        index += current_tag.length();
         tags.pop_back();
     }
     if( !tags.empty()) {
       std::string current_tag = tags.back();
-      htext->append(current_tag.c_str(), current_tag.length());
+      htext->insert(index, current_tag.c_str());
       tags.pop_back();
     }
 }

When the htext variable contains what produces the "PCDATA invalid Char value" errors in xmllint, the append function does not work as it should.

To force the ending tags to be appended to the GooString, the insert function is used instead.

This produces output that does not have the "opening and ending mismatch" and "premature end of data in tag" errors when ran through xmllint.
Comment 9 Albert Astals Cid 2015-03-27 21:53:04 UTC
Sorry for the delay in review.

The patch fixes tag orders but it's breaking links, in your own document, with the unpatched pdftohtml, the 144 in Alleinerziehende is a link, with your patch, it is not, that is wrong since in the original it is.

Please fix that.
Comment 10 GitLab Migration User 2018-08-21 11:14:30 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/577.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.