Bug 48270

Summary: pdftohtml converts images to garbled "solarized" jpegs
Product: poppler Reporter: skierpage <info>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: sdp
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: garbled-color image produced by pdftohtml
correct image produced by pdfimages
first three pages of problem PDF
Proposed simple patch

Description skierpage 2012-04-03 16:03:11 UTC
Created attachment 59456 [details]
garbled-color image produced by pdftohtml

pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64.

I downloaded the 14MB PDF at http://www.swanyretail.com/SwanySkiCatalog_final-LO.pdf and ran it through pdftohtml with no options.
All the resulting jpegs are the right size but have garbled colors. They look like "solarized" negatives: mostly black, little color. 

To reproduce the problem,
  mkdir bugtemp
  cd bugtemp
  wget http://www.swanyretail.com/SwanySkiCatalog_final-LO.pdf
  pdftohtml -f 1 -l 3 SwanySkiCatalog_final-LO.pdf swany.html'
to convert just the first three pages.  Then look at the resulting swanys.html and/or the individual jpegs.

Here's the first bad image in the original PDF:

4618 0 obj
<</Intent/RelativeColorimetric/Subtype/Image/Length 88781/Filter/DCTDecode/Name/X/BitsPerComponent 8/ColorSpace/DeviceCMYK/Width 629/Height 814/Type/XObject>>stream
ÿØÿî^@^NAdobe^@d<80>^@^@^@^BÿÛ^@<84>^@^L^H^H^H^H^H^L^H^H^L^P^K^K^K^P^T^N^M^M^N^T^X^R^S^S^S^R^X^T^R^T^T^T^T^R^T^T^[^^^^^^^[^T$''''$25552;;;;;;;;;;^A^M^

I'm guessing, perhaps pdftohtml doesn't handle ColorSpace DeviceCMYK ?

pdfimages extracts these as .ppm files that preview fine in Gwenview. pdfimages' -j option does nothing.

I'll attach the bad jpeg from pdftohtml and the good ppm from pdfimages, and pages 1-3 extracted with pdfseparate/pdfunite
Comment 1 skierpage 2012-04-03 16:05:49 UTC
Created attachment 59457 [details]
correct image produced by pdfimages
Comment 2 skierpage 2012-04-03 16:07:48 UTC
Created attachment 59458 [details]
first three pages of problem PDF

pages 1-3 of http://www.swanyretail.com/SwanySkiCatalog_final-LO.pdf in case it goes away.
Comment 3 Loïc Corbasson 2013-01-07 14:09:33 UTC
*** Bug 35026 has been marked as a duplicate of this bug. ***
Comment 4 Johannes Brandstätter 2013-07-26 14:09:55 UTC
Created attachment 83039 [details] [review]
Proposed simple patch

I tried to fix the problem and this patch resolves the issue for all my tested PDFs.
This is my first patch I provide ever, so I hope it is okay to upload it here.
Comment 5 Albert Astals Cid 2013-07-26 17:30:50 UTC
Attaching the patch here is fine.

Can you explain why the extra code? I mean, why we need to test that?
Comment 6 Johannes Brandstätter 2013-07-26 18:42:00 UTC
Thank you for your response Albert.
I can't really answer your question, only tell you where I got the code from.

I took the extra code from pdfimages as it was said that utility works.
You can see the same code in ImageOutputDev.cc on line 269 to 272.
I don't really know what is behind GfxImageColorMap::getNumPixelComps().

I just found out that you provided the original fix for ImageOutputDev.cc (commit: 2df6d530) in January 2009.
Comment 7 Albert Astals Cid 2013-08-16 23:13:43 UTC
sigh copied code everywhere :-/

Ok, i've commited the change and will be in 0.24.1

FWIW I did not do the code, as the commit log says, i just brought it from xpdf

Thanks for the investigation!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.