Bug 50739 - pdftohtml -xml fails to extract text that is extracted in pdftotext
Summary: pdftohtml -xml fails to extract text that is extracted in pdftotext
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-05 09:21 UTC by Petter Reinholdtsen
Modified: 2018-08-21 10:53 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Petter Reinholdtsen 2012-06-05 09:21:14 UTC
When I convert
http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf
to XML using

  pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf

I get the following content-less XML file.  I find this rather strange,
as the PDF is searchable using xpdf, okular and evince.  Any idea where
the text went?  Anything I can do to get access to the text as XML?

This is the output I get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="0" size="18" family="Helvetica" color="#000000"/>
        <fontspec id="1" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="2" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="3" size="7" family="Helvetica" color="#000000"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="4" size="6" family="Helvetica" color="#000000"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792" width="612">
</page>
</pdf2xml>

This problem is also reported to Debian as http://bugs.debian.org/676238
Comment 1 Pino Toscano 2012-06-21 03:43:01 UTC
This bug has been reported against poppler 0.12.4 (old), but it can be reproduced also with a newer poppler 0.18.4. I didn't try with 0.20.x though.

Note adding also -hidden to the arguments makes the text show up in the XML output.
Comment 2 GitLab Migration User 2018-08-21 10:53:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/417.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.