Bug 92449 - pdftohtml ignore png format option and extract inverted jpg images
Summary: pdftohtml ignore png format option and extract inverted jpg images
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-13 16:07 UTC by c1tru55
Modified: 2018-08-20 22:00 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description c1tru55 2015-10-13 16:07:25 UTC
Hi all,

I use pdftohtml 0.37.0 on Ubuntu.

When I call 
pdftohtml -xml -fmt png
command - some images are extracted as .jpg (all with inverted colors) and some as .png (all with normal colors).

When I call
pdfimages -all test.pdf test
command - I get same result for images (inverted .jpg and normal .png).

But when I call
pdfimages -png test.pdf test
command - I get only .png images and all of it has normal colors.

Questions:
1. Is it possible to convert pdf to html/xml using pdftohtml utility with export all images to .png? Or at least to have non-inverted .jpg images? Because now I need to call 2 different commands for same pdf page to get correct result? It seems that `-fmt` option doesn't work
2. if using `pdfimages -all test.pdf test` command first image is extracted as .jpg and second as .png - does it mean that first image is actually stored in JPG format in pdf? and same for second image?
3. is it ok, if exported via `pdftohtml -xml` image has one resolution (width-height), but another inside generated xml? for example, file has width=145, height=145, but inside xml it has width=105, height=105?

PS: I can attach pdf file if needed

Thanks in advance,
Comment 1 Francisco 2016-10-10 17:03:12 UTC
I've stumble upon the same bug, recent version (september 2016) compiled from git master (poppler 0.47).

pdfimages -j test.pdf out
Produce an inverted grayscale jpeg.

pdfimages -png test.pdf out
Produce a normal grayscale image.

pdfimages -list test.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2817  1981  sep     1   8  jpeg   no        16  0   343   170  112K 2.1%


I've tested several PDF (sorry can't share it) and I found that only the jpeg with a colorspace 'sep' (csSeparation) produce an inverted grayscale image.
Comment 2 GitLab Migration User 2018-08-20 22:00:08 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/151.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.