I've been trying to incorporate pdftohtml into my frontend renderer and have had some success with some documents. Other more complex documents though are having problems. My test document is the Nikon D3s brochure: wget http://imaging.nikon.com/products/imaging/lineup/digitalcamera/slr/d3s/pdf/d3s_16p.pdf Rendering with the following produces a pretty accurate representation of the document: pdftohtml -c d3s_16p.pdf However, when I output to XML using -xml some of the images that worked previously are not output. They are not extracted or even included in the XML output. Also, the images that are extracted are included with the wrong dimensions so the resulting page looks very out of whack. All of the text is rendered correctly though. Tried latest version from git with same results.
Not critical
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/127.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.