Bug 50739

Summary: pdftohtml -xml fails to extract text that is extracted in pdftotext
Product: poppler Reporter: Petter Reinholdtsen <pere>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Petter Reinholdtsen 2012-06-05 09:21:14 UTC
When I convert
http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf
to XML using

  pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf

I get the following content-less XML file.  I find this rather strange,
as the PDF is searchable using xpdf, okular and evince.  Any idea where
the text went?  Anything I can do to get access to the text as XML?

This is the output I get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="0" size="18" family="Helvetica" color="#000000"/>
        <fontspec id="1" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="2" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="3" size="7" family="Helvetica" color="#000000"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="4" size="6" family="Helvetica" color="#000000"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792" width="612">
</page>
</pdf2xml>

This problem is also reported to Debian as http://bugs.debian.org/676238
Comment 1 Pino Toscano 2012-06-21 03:43:01 UTC
This bug has been reported against poppler 0.12.4 (old), but it can be reproduced also with a newer poppler 0.18.4. I didn't try with 0.20.x though.

Note adding also -hidden to the arguments makes the text show up in the XML output.
Comment 2 GitLab Migration User 2018-08-21 10:53:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/417.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.