Created attachment 137691 [details]
I noticed this while verifying that https://bugzilla.gnome.org/show_bug.cgi?id=757876 has been fixed in cairo. The 'missing spaces' problem is fixed, but I see a new issue now.
If I run `pdftocairo -pdf` on the attached PDF file, images in the output PDF show reversed black and white. I traced it to this commit in cairo:
commit b207a932a2d3740984319dffd58a0791580597cd (HEAD, refs/bisect/bad)
Author: Peter TB Brett <email@example.com>
Date: Fri Sep 9 22:35:55 2016 +0930
Correctly decode Adobe CMYK JPEGs in PDF export
Adobe PhotoShop generates CMYK JPEG files with inverted CMYK. When a
JPEG file with this format is included in a PDF file, a `/Decode`
array must be included to convert to "normal" CMYK.
These JPEG files can be detected via the presence of the APP14 "Adobe"
marker. However, PDF viewers are not required to detect and handle
this private marker, so it must be detected and handled (by adding a
`/Decode`) by the PDF generator.
Signed-Off-By: Peter TB Brett <firstname.lastname@example.org>
Created attachment 137692 [details]
output.pdf - generated by `pdftocairo -pdf`
File generated by running `pdftocairo -pdf doc_with_whitespace.pdf output.pdf`
Created attachment 137693 [details]
rendering of input pdf
generated by running `pdftocairo -png -singlefile doc_with_whitespace.pdf rendering-of-input-pdf`
Created attachment 137694 [details]
rendering of output pdf
generated by running `pdftocairo -png -singlefile output.pdf rendering-of-output-pdf` on file created by `pdftocairo -pdf`.
Just to link to the history, this came from bug: https://bugs.freedesktop.org/show_bug.cgi?id=97612
I tried deleting the line "/Decode [ 1 0 1 0 1 0 1 0 ]" from the pdf, which that patch added, and confirm that without it the image colors are no longer inverted.
Here is a stackoverflow discussion of problems with adobe CMYK jpegs, where the "Decode" fix is suggested: https://graphicdesign.stackexchange.com/a/15906
I haven't yet understood what is going on well. It seems possible to me that there is some kind of double-inversion going on: Maybe cairo adds the Decode line to invert the jpegs, but otherwise leaves the metadata alone. Then when you view that generated pdf again using cairo, it inverts twice: Once because it detects an adobe jpeg, and again because it finds the new Decode line. I confirm by looking inside the PDF that the generated PDF still has the "Adobe" header in the jpeg, and also has an additional Decode line.
Also, for me this is a fairly serious bug which affects about half of the PDFs I try to print in the last couple months, after I upgraded cairo. For me it makes evince/poppler effectively unusable for printing pdfs.
(In reply to Jason Crain from comment #0)
> If I run `pdftocairo -pdf` on the attached PDF file, images in the output
> PDF show reversed black and white. I traced it to this commit in cairo:
> commit b207a932a2d3740984319dffd58a0791580597cd (HEAD, refs/bisect/bad)
> Author: Peter TB Brett <email@example.com>
> Date: Fri Sep 9 22:35:55 2016 +0930
> Correctly decode Adobe CMYK JPEGs in PDF export
Adrian, as you committed this change (and Peter does not seem to have a Bugzilla account), could you please revert the change? If it’s not that easy, please tell us, what the correct workflow is to get this change reverted.
maybe a solution to this would be to convert the color data by a Type 4 function from the YCCK to an expected color space in the PDF and erase the flag from the jpeg stream (replacing "Adobe" by " ").
What happens if you leave the "Decode" in there, but remove the "Adobe"? That might be the right solution.
The "Adobe" tag seems to have no effect.
I tried replacing "Adobe" by " ", but leaving "Decode...". The image is inverted either way. Conversely, removing the "Decode" gives me the correct image whether or not "Adobe" is present.
I'd have to debug some more to figure out why the "Adobe" tag is apparently ignored. A grep through the cairo source code shows the variable "is_adobe_jpeg" is only used in a single place, when deciding whether to add the "Decode" line, and is not used anywhere else.
At this point is seems like the addition of "Decode" was just not right. I'm curious now what broken PDFs the addition of "Decode" was supposed to fix, given that cairo seems to ignore the adobe header.
It seems quite possible there were some Adobe images that needed this Decode line. However the test to see if it is one of these images is wrong, and is identifying other Adobe output that does not need it.
Somebody needs to find one of these actual broken images and figure out a more accurate test to identify them (ie the tag has to be "Adobe" but also some other test needs to be true).
If nobody can find one of these images than I think the patch that adds the Decode should be removed.
I looked at the cairo/poppler/libjpeg code and I think I see why the "adobe" tag has no effect.
In poppler in DCTStream.cc in init(), it chooses the colorXform as follows: If the adobe header is present it uses the transform specified in the header, which for these jpegs is the CMYK transform. If the header is *not* present and there are 4 color components (as in these jpegs) also use the CMYK transform. So for our jpeg the CMYK transform is used whether or not the adobe header is present. After that point, the adobe header seems to be ignored, and poppler treats adobe-CMYK and non-adobe-CMYK identically.
So possibly there is a poppler bug involved here. It seems like poppler inverts all CMYK images.
So, I've figured out that almost all image viewers invert all CMYK jpegs, regardless of Adobe header. This includes eog/gdk/imagemagik/Pillow/gimp. The exception is poppler which does not invert. This is opposite of what I expected. Here's how I figured it out:
I started with one of the cmyk jpegs we are discussing which appears with a white background when inside a PDF without the "Decode" line. When I copy the raw jpeg data from the pdf to a file and view it with eog/gimp/Pillow, it is inverted and has a black background.
Next I looked at the raw output buffer of libjpeg, which all the programs use. The libjpeg doc has a big warning that it absolutely never inverts cmyk, and it is up to the user to invert adobe cmyk jpegs. The raw libjpeg output for my image has values cmyk 0/0/0/0 which is plain white. So, surprisingly, my raw jpeg actually has non-inverted colors! However all the image viewers show it as black, so they must be inverting.
I was able to understand the GDK-pixbuf (used by eog) and Pillow (Python Image Library) code. It is clear in both cases that all CMYK jpegs are interpreted as "CMYK;I", that is CMYK inverted. GDK even has a comment in the code: /* We now assume that all CMYK JPEG files use inverted CMYK, as Photoshop does See https://bugzilla.gnome.org/show_bug.cgi?id=618096 */.
Poppler is the odd one out because it does not invert, apparently. In other words cmyk jpegs embedded in pdfs have no inversion. When you extract them (eg using pdfimages), you obtain an uninverted jpeg, but when you view it in any jpeg viewer I have tried, the viewer inverts it! It is still mysterious to me why cmyk jpegs only in pdfs are uninverted, but for whatever reason, they are.
I think the lesson of all of this is: We should never invert based on the adobe header, since all other imaging programs I looked at ignore it (for the purpose of inversion, at least). Significantly this includes poppler. I think we should revert the added Decode line.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/cairo/cairo/issues/156.