Bug 24890 - pdftohtml -xml produces invalid XML markup
Summary: pdftohtml -xml produces invalid XML markup
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL: http://www.tml.tkk.fi/Studies/T-110.5...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-04 00:44 UTC by Piotr Findeisen
Modified: 2018-08-21 12:21 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
PDF that causes pdftohtml to produce invalid XML (342.08 KB, application/pdf)
2009-11-04 00:44 UTC, Piotr Findeisen
Details

Description Piotr Findeisen 2009-11-04 00:44:16 UTC
Created attachment 30952 [details]
PDF that causes pdftohtml to produce invalid XML

On certain PDF files the pdftohtml utility run with '-xml' option produces XML that is invalid and cannot be parsed by strict compliant parsers.

Test case:
# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
    pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
    python -c 'from xml.parsers.expat import ParserCreate;
ParserCreate().ParseFile(open("x.xml"))'

With pdftohtml's versions 0.6, 0.10, 0.12 and python version 2.5 it produces:
Page-1
Traceback (most recent call last):
    File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column 63

The offending byte is an ASCII control byte (0x11). It's followed by other ASCII control bytes.

I attached the burlacu_mihai.pdf for reference here in case is becomes not available on the original server. I'll contact the copyright owner for the acknowledgment and will remove this file in case the author doesn't allow redistribution of the file.
Comment 1 Reece H. Dunn 2009-11-04 00:51:20 UTC
From my reply to the original email:

I'm not sure what the fix is, but the line with the error is:
   <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
and firefox gives:
   <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
   ---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).

This will cause any xml parser to choke, as the characters are
invalid. What I don't know is why/how these are appearing in
pdftohtml.

Looking at the PDF in okular (which appears to render the PDF
correctly there), shows a mathematical equation for the faulty lines,
specifically:

<text top="606" left="101" width="173" height="10" font="6">Digital
signal processing basic formula:</text>
<text top="632" left="101" width="25" height="10" font="6">y(t) =</text>
<text top="626" left="133" width="0" height="0" font="7"> </text>
<text top="631" left="133" width="0" height="0" font="7">¡</text>
<text top="647" left="128" width="0" height="0" font="7">¢</text>
<text top="646" left="134" width="11" height="0" font="7"> ¤£</text>
<text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>

should be (in the proper math layout for this formula):

   y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du

where the h(u)x(t - u)du bit is in the stylised script used in maths.

My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).

Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font="7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.
Comment 2 Reece H. Dunn 2009-11-04 00:58:25 UTC
In the generated html and/or xml file (the html output is similarly affected), control characters (and other invalid xml characters) are being written out to the file.

The XML version will cause an XML parser to generate an error when it encounters these characters (it first chokes on a 0x11 control character).

The HTML version loads the page, but displays garbage instead of the integral equation (same as viewing either file in a text editor).

a/ My initial thought was that the characters (such as the integral sign) in the integral part of the equation were being written as Unicode. For example, U+222B is the Unicode code point for the integral sign (∫) [1]. If that were the case, the UTF-8 forms would be written out and it would form valid XML and HTML output.

b/ My next thought was that the control characters in the ascii and Unicode code pages corresponded to the correct glyphs when using font 7 (e.g. if the font was a special mathematical font). This is not right, as the font is the Times font (looking at the font defiinition), and the number of characters in that element don't match the number of characters in the equation (thus indicating a multi-byte encoding is being used).

c/ Another possibility that has occurred to me is that what is shown is the raw UTF-8 byte sequence, and that is being UTF-8 encoded! Removing the UTF-8 encoding in the html version, I get " ¤£" instead of " ¤£" for one of the outputted text nodes.

My current thinking (without looking at the code yet, just from examining the behaviour) is that (c) is the most probable cause of this error.

[1] http://en.wikipedia.org/wiki/Integral_symbol
Comment 3 Reece H. Dunn 2009-11-04 01:36:44 UTC
pdftotext (and likely the other utilities) is affected. It gives:

    £ ¤  ¢ ¦ §©§¦¥  ¡ ¨¦  ¥ ¨

for the equation (it even misses out the `y(t) =` part).
Comment 4 Reece H. Dunn 2009-11-04 01:45:39 UTC
The text is being written out on line 651 of utils/HtmlOuputDev.cc for xml (there are other cases for simple and complex html outputs, not to mention the places in the other output formatters). The line is:

      fputs(str1->getCString(),f);

where str1 is a GooString. This assumes that what is returned from getCString() is in the correct encoding (i.e. the one that has been requested by the output stream).
Comment 5 Michael Blais Younkin 2012-07-06 06:40:46 UTC
Has any progress been made on this issue? I am attempting to extract the text with position from a large number of PDF files, and I have encountered this issue several times so far.
Comment 6 Ori Avtalion 2015-11-24 19:59:24 UTC
Confirming the issue with latest Poppler 0.38

For example, with the following file:
http://www.ikea.com/us/en/assembly_instructions/dioder-led-multi-use-light__AA-241740-1_pub.PDF

The output for page 2 includes the ASCII control characters 0x01, 0x12, 0x0c, 0x0d, 0x0e.
Comment 7 Ori Avtalion 2015-11-28 13:13:44 UTC
The file mentioned in Comment 6 exhibits another bug:
The xml for page 4 opens some <b> tags and doesn't close them.
Comment 8 Ori Avtalion 2015-11-28 13:52:33 UTC
(In reply to Ori Avtalion from comment #7)
> The file mentioned in Comment 6 exhibits another bug:
> The xml for page 4 opens some <b> tags and doesn't close them.

Another problem in the code I'm noticing:
When text is both bold and italic, in some cases, it outputs <b><i> and sometimes <i><b>. It should be consistent, and the CloseTags function should close it in the reverse order.
Comment 9 Daniel Stone 2018-08-21 12:09:37 UTC
Sorry, but for some reason Bugzilla failed to generate valid XML to export this bug when migrating to GitLab. As a result we've been unable to duplicate it there. Please file a new issue on https://gitlab.freedesktop.org/poppler/poppler/issues/new/ referencing this one. Thanks, and sorry again for the inconvenience.
Comment 10 Ori Avtalion 2018-08-21 12:21:24 UTC
Oh, the irony. So now I need to open a bug about Bugzilla producing invalid XML.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.