Bug 25625 - pdfimages does not extract inline jpeg images as jpeg
Summary: pdfimages does not extract inline jpeg images as jpeg
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-13 15:11 UTC by Laur
Modified: 2017-08-16 21:35 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
example pdf (15.30 KB, application/pdf)
2009-12-14 17:06 UTC, Laur
Details
extra inline images (10.23 KB, patch)
2013-08-29 11:38 UTC, Adrian Johnson
Details | Splinter Review
extract inline images v2 (10.56 KB, patch)
2017-08-16 11:33 UTC, Adrian Johnson
Details | Splinter Review

Description Laur 2009-12-13 15:11:34 UTC
pdfimages does not seem to be able to extract an inline jpeg image, instead it will output the image as an uncompressed ppm file. This was tested with poppler-utils 0.12.0 on Ubuntu 9.10.

How to duplicate:

Create an inline jpeg in a pdf file. This can be done with sam2p http://code.google.com/p/sam2p/ using the command "sam2p image.jpg image.pdf". Try to extract the jpeg back out with "pdfimages -j image.pdf image". This will create image-000.ppm instead of the expected image-000.jpg. If you instead create a pdf using XObjects, such as "sam2p -pdf:2 image.jpg image.pdf", the jpeg is correctly extracted, md5sums of image.jpg and image-000.jpg match. Note: this bug is present in the latest xpdf package as well.
Comment 1 Albert Astals Cid 2009-12-14 12:33:07 UTC
Please provide a pdf with that problem.
Comment 2 Laur 2009-12-14 17:06:47 UTC
Created attachment 32076 [details]
example pdf

Attached is a sample pdf with an inline image created with sam2p.
Comment 3 Laur 2009-12-14 17:10:01 UTC
FYI, I also reported this bug to the author of xpdf, here was his response:

"The issue is that there is no good way to find the end of an inline image data stream without parsing the data.  So I would need code that read the JPEG data without decompressing it - just doing enough parsing to find the end of the JPEG stream.  That's probably not too hard to do, but it's not high on my priority list."

Thanks.
Comment 4 Albert Astals Cid 2009-12-16 15:09:09 UTC
Same answer, if anyone is interested to code a patch, you need to start looking at ImageOutputDev.cc the problem lies in that the drawImg call is "inlineImg" so you can't know the length of the stream, and you need to work a bit more. Patches welcome since this is also not high priority for us either.
Comment 5 Adrian Johnson 2013-08-29 11:38:45 UTC
Created attachment 84828 [details] [review]
extra inline images

Patch to extract inline images.
Comment 6 Albert Astals Cid 2013-08-29 19:54:13 UTC
The patch looks a bit complicated but i guess there's nothing more you can actually do.

Have you run this over all the files with and without the reusableA bit set?
Comment 7 Adrian Johnson 2013-08-29 21:45:13 UTC
(In reply to comment #6)
> Have you run this over all the files with and without the reusableA bit set?

yes
Comment 8 Adrian Johnson 2017-08-16 11:33:41 UTC
Created attachment 133545 [details] [review]
extract inline images v2

Rebased to master
Comment 9 Albert Astals Cid 2017-08-16 21:26:42 UTC
good i guess
Comment 10 Adrian Johnson 2017-08-16 21:35:29 UTC
Pushed


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.