Bug 99883 - pdfimages extracts lots of same images with the same object number.
Summary: pdfimages extracts lots of same images with the same object number.
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-21 08:27 UTC by 石印
Modified: 2018-08-21 11:17 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
problem file (7.71 MB, application/pdf)
2017-02-21 08:27 UTC, 石印
Details
The patch that fix poppler utils pdfimages extract too many same pictures (1.47 KB, patch)
2017-03-01 01:35 UTC, 石印
Details | Splinter Review

Description 石印 2017-02-21 08:27:19 UTC
Created attachment 129787 [details]
problem file

I have a pdf file, pdfimages list a lot of images with the object number. These images are the same. There are only about a thousand pictures with diffrent object number, but pdfimages list more than 256,000 items. Finally, pdfimages extract all pictures listed and most of them are the same. The total size of all pictures is really huge. I upload the pdf, and my simple patch below ( may not good, but work :D ).

From 237f4e0887eff2f22d5542dfed33fa94a8c7b0ff Mon Sep 17 00:00:00 2001
From: Ryan <ryanorz@126.com>
Date: Tue, 21 Feb 2017 16:11:53 +0800
Subject: [PATCH] Fix(poppler-utils): pdfimages extract too many same pictures
 with the same object number.

---
 utils/ImageOutputDev.cc | 8 ++++++++
 utils/ImageOutputDev.h  | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/utils/ImageOutputDev.cc b/utils/ImageOutputDev.cc
index 5de51ad..26bf95b 100644
--- a/utils/ImageOutputDev.cc
+++ b/utils/ImageOutputDev.cc
@@ -442,6 +442,14 @@ void ImageOutputDev::writeImageFile(ImgWriter *writer, ImageFormat format, const
 void ImageOutputDev::writeImage(GfxState *state, Object *ref, Stream *str,
                                int width, int height,
                                GfxImageColorMap *colorMap, GBool inlineImg) {
+  if (ref->isRef()) {
+    const Ref imageRef = ref->getRef();
+    if (refNums.find(imageRef.num) != refNums.end())
+      return;
+    else
+      refNums.insert(imageRef.num);
+  }
+
   ImageFormat format;
 
   if (dumpJPEG && str->getKind() == strDCT &&
diff --git a/utils/ImageOutputDev.h b/utils/ImageOutputDev.h
index a694bbc..89c67ac 100644
--- a/utils/ImageOutputDev.h
+++ b/utils/ImageOutputDev.h
@@ -35,6 +35,7 @@
 #endif
 
 #include <stdio.h>
+#include <set>
 #include "goo/gtypes.h"
 #include "goo/ImgWriter.h"
 #include "OutputDev.h"
@@ -173,6 +174,7 @@ private:
   int pageNum;                 // current page number
   int imgNum;                  // current image number
   GBool ok;                    // set up ok?
+  std::set<int> refNums;
 };
 
 #endif
-- 
2.10.2
Comment 1 Albert Astals Cid 2017-02-21 22:45:18 UTC
What's your real name we should use for copyright attribution, is it "Shi Yin" ?
Comment 2 石印 2017-02-22 01:44:17 UTC
(In reply to Albert Astals Cid from comment #1)
> What's your real name we should use for copyright attribution, is it "Shi
> Yin" ?

Either "Shi Yin" or "石印" is OK.
Comment 3 Albert Astals Cid 2017-02-28 22:36:36 UTC
Please attach the patch as file instead of the text, otherwise it's a pain to integrate it and i have to basically type things by hand and i'm not going to do that.
Comment 4 石印 2017-03-01 01:35:57 UTC
Created attachment 129990 [details] [review]
The patch that fix poppler utils pdfimages extract too many same pictures
Comment 5 石印 2017-03-01 01:39:53 UTC
(In reply to Albert Astals Cid from comment #3)
> Please attach the patch as file instead of the text, otherwise it's a pain
> to integrate it and i have to basically type things by hand and i'm not
> going to do that.

Sorry to take inconvenience to you. When I report the bug, I only permited to upload one file by the bug platform. So I pasted the patch content to my comment. I have uploaded the patch.
Comment 6 Albert Astals Cid 2017-03-03 21:54:27 UTC
Now that i look at the patch i don't think it's good enough.

For example, the colorMap parameter is used in some of the branches and will affect the file written, but with your code it would not be written again, i don't think that's correct.
Comment 7 GitLab Migration User 2018-08-21 11:17:02 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/600.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.