Bug 98305

Summary: -xml outputs malformed xml
Product: poppler Reporter: daniel.van.den.ouden
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: daniel.van.den.ouden
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description daniel.van.den.ouden 2016-10-18 09:15:03 UTC
Overview:

    The following pdf causes pdftohtml to output malformed xml:
    http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf 
    The resulting xml file has multiple similar errors, the first one on line 71641:
    <text top="180" left="71" width="101" height="15" font="11"><b>Sp<a href="Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.html#876">eed [MHz] </b>(3)</a></text>
    (the closing b and a tags are not in the correct order)

Steps to Reproduce: 

    1) wget http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf 

    2) pdftohtml -q -i -xml Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf output.xml

Actual Results: 

    malformed xml

Expected Results: 

    well-formed xml. And I'm not quite sure if the link is placed on the correct piece of text. In the pdf only the text "(3)" is clickable and none of it is bold.

Build Date & Hardware: 

    Built on 2016-10-18 from source (0.48.0) on Ubunty 14.04 LTS

Additional Builds and Platforms: 

    Also occurred in the version of pdftohtml that was installed using apt-get (0.28 if I recall correctly)


Cheers,


Daniel
Comment 1 daniel.van.den.ouden 2016-10-18 18:03:34 UTC
possibly (probably) duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=89239
Comment 2 GitLab Migration User 2018-08-21 11:11:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/556.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.