In pdftohtml, seen in the old pre-poppler version as well as in the (current?) version used by calibre, sometimes when a PDF file is converted to html the images are flipped vertically - mirrored that is, not rotated 180 degrees. This can be seen in trying to pdftohtml the file http://www.14ers.com/pdf/yale1_print.pdf - similarly if you import this into calibre and convert it to an epub or other ebook format, the images are flipped. This was reported against pdftohtml years ago but apparently never acted on and I presume the bug report was lost in the transition. As far as I can tell, there is no report on it in the poppler bugzilla. More information may be found at the following old bug reports, or I can be contacted on IRC. http://bugs.calibre-ebook.com/ticket/227 <-- original report against calibre http://sourceforge.net/tracker/index.php?func=detail&aid=1808775&group_id=45839&atid=444239 <-- original report against pdftohtml http://bugs.calibre-ebook.com/ticket/7874 <-- duplicate report against calibre
Per chance I have found that pdf2xml (based on Xpdf) internally calculates the flip of the images. In the file: http://pdf2xml.cvs.sourceforge.net/viewvc/pdf2xml/pdf2xml/src/XmlOutputDev.cc?revision=1.13&view=markup search for the "Flip" to see how the flips are identified. Probably flip of the image itself is too much to ask for from the pdftohtml itself. But it would have been nice if pdftohtml at least somehow stored the fact that the image requires the x or y flip, e.g. in the file name using a special tag (for example: xflip, yflip, xyflip). That would at least allow during further (automated) processing of the output to correct the problem.
I see in the development branch of the poppler, `pdftothml -xml` already extracts some(*) of the images - and also saves the orientation: : <image top="842" left="56" width="513" height="-782" src="left-1_1.jpg"/> Note the negative height which denotes the presence of the vertical flip. (*) It seems that the pdftohtml at the moment can't handle monochrome images. In my case version of `pdftohtml -xml` from git has extracted all jpegs - but the monochrome images used at chapter openings were not extracted. It looks to me the monochrome images are painted via the HtmlOutputDev::drawImageMask() which at the moment handles only the verbatim JPEGs.
Created attachment 58269 [details] [review] patch for HtmlOutputDev.cc to extract mochrome mask images as PNGs I have put together an experimental patch to extract the monochrome images as PNGs. The patch applies to the git version of the poppler, as cloned few hours ago. Compiled and tested on Linux/AMD64 with several of my PDF books in -xml and -noframe modes.
Comment on attachment 58269 [details] [review] patch for HtmlOutputDev.cc to extract mochrome mask images as PNGs For the monochrome issue fix, please see bug 47186 where there is a new version of the patch.
Created attachment 58294 [details] [review] the patch to flip image h/v if PNG support is present A patch for pdftohtml to flip the images vertically and/or horizontally. Gotchas: (1) image pixels are read fully into the memory, iow image must fit into the RAM (2) flip can be done on PNGs, the format we have control over, thus the flipped JPEGs are internally converted into the PNGs (if PNG support is enabled, otherwise they are dumped as is). Please note that the patch applies on top of the patch from bug 47186
Created attachment 58299 [details] [review] image flip patch, v2 New version of the patch. Difference to previous one: in case if JPEG support is present, try to preserve the fact that images are in JPEG and after flipping them, write them to disk as JPEGs.
Wondering if it shouldn't just be easier to output some css to do the flipping?
(In reply to comment #7) > Wondering if it shouldn't just be easier to output some css to do the flipping? CSS can be nice. But you should keep in mind that most users of pdftohtml are in fact use it via calibre. Honestly, I think that calibre should switch to `pdftohtml -xml`. That would also allow them to implement better paragraph recognition. And the XML already contains the hint about image flip - width and/or height are negative - which hopefully is sufficient. I can't be sure, but if Debian's dependency do not lie, calibre already uses the ImageMagick. Thus, on side of poppler it would be sufficient to give in HTML output some easily accessible hint that the image is flipped. For that, one needs feedback from the calibre developers. Pinged them back with new ticked as I couldn't find the original one: https://bugs.launchpad.net/calibre/+bug/952646
Feedback from calibre https://bugs.launchpad.net/calibre/+bug/952646 : > A hint in the html/css is fine for calibre's purposes. Note that calibre's next > gen PDF engine handles image rotation automatically using ImageMagick (you > might want to use that code for a simpler patch). See pdf/images.cpp. Of > course, that means poppler will have to depend on imagemagick, which may be a > problem for them. So for HTML output CSS it is. I will try to make a patch for that. -- I guess that's the way to do it in CSS: http://stackoverflow.com/questions/1309055/cross-browser-way-to-flip-html-image-via-javascript-css Unfortunately the CSS is bit too heavy for an inline style attribute. Albert, would you prefer CSS being added to every HTML page (that would be simpler) or an extra CSS file generated?
> CSS can be nice. But you should keep in mind that most users of pdftohtml are > in fact use it via calibre. Do you have facts for that other than your belief? > Albert, would you prefer CSS being added to every HTML page (that would be > simpler) or an extra CSS file generated? I think that the CSS on each HTML file is fine.
Created attachment 58347 [details] [review] flip images using CSS Flipping of images in HTML output using the CSS. Applies on top of the git's master. P.S. > Do you have facts for that other than your belief? I make it up as I go ;) Seriously though, I have seen several server-side solutions and they use some sort of patched up pdf2html or pdf2xml based on the Xpdf (and feed their output into their own converters). Googling pdf2(whatever) also pretty much never shows the poppler. calibre was the first thing I have ever seen (and at that in its source code) using the poppler for PDF conversion. Conjecture at best, but the best one I got. Overall the pdftohtml, even with -xml switch, is still lagging in many areas behind other alternatives which do e.g. cropping support and more detailed output, better suited for post-processing. And with the rise of e-books, the post-processing is a must: paragraph detection at the very least, scaling down images to save space, detecting/adding chapters, removing page numbers, and so on.
Pushed
Created attachment 133760 [details] The file for future reference
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.