When I convert http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf to XML using pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf I get the following content-less XML file. I find this rather strange, as the PDF is searchable using xpdf, okular and evince. Any idea where the text went? Anything I can do to get access to the text as XML? This is the output I get: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="792" width="612"> <fontspec id="0" size="18" family="Helvetica" color="#000000"/> <fontspec id="1" size="5" family="Helvetica" color="#000000"/> <fontspec id="2" size="5" family="Helvetica" color="#000000"/> <fontspec id="3" size="7" family="Helvetica" color="#000000"/> </page> <page number="2" position="absolute" top="0" left="0" height="792" width="612"> <fontspec id="4" size="6" family="Helvetica" color="#000000"/> </page> <page number="3" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="4" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="5" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="6" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="7" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="8" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="9" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="10" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="11" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="12" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="13" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="14" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="15" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="16" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="17" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="18" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="19" position="absolute" top="0" left="0" height="792" width="612"> </page> <page number="20" position="absolute" top="0" left="0" height="792" width="612"> </page> </pdf2xml> This problem is also reported to Debian as http://bugs.debian.org/676238
This bug has been reported against poppler 0.12.4 (old), but it can be reproduced also with a newer poppler 0.18.4. I didn't try with 0.20.x though. Note adding also -hidden to the arguments makes the text show up in the XML output.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/417.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.