Bug 56843

Summary: [PATCH] Feature : extract media in html format
Product: poppler Reporter: Allemand Corentin <corentin.allemand>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Patch for extract media
Extract AnnotRichMedia
extract media
Extract Pictures
Extract RichMedia
Extract RichMedia
Extract RichMedia
Patch for extract media
Patch for extract media

Description Allemand Corentin 2012-11-07 16:15:27 UTC
Created attachment 69663 [details] [review]
Patch for extract media

I added an argument on pdftohtml to extract the images separately.
This argument also extracts videos and sounds from RichMedia
Comment 1 Albert Astals Cid 2012-11-11 23:50:27 UTC
Hi, thanks for the patches, i'm confused though, your "patch extract AnnotRichMedia.diff" contains patches by Fabio (not done by you)

Can you please clean them up?

Also can you please attach the patches in text form, not inside a zip?
Comment 2 Allemand Corentin 2012-11-13 08:43:03 UTC
Created attachment 69989 [details] [review]
Extract AnnotRichMedia
Comment 3 Allemand Corentin 2012-11-13 08:44:38 UTC
Created attachment 69990 [details] [review]
extract media
Comment 4 Allemand Corentin 2012-11-19 08:58:17 UTC
Created attachment 70245 [details] [review]
Extract Pictures
Comment 5 Allemand Corentin 2012-11-19 08:58:52 UTC
Created attachment 70246 [details] [review]
Extract RichMedia
Comment 6 Allemand Corentin 2012-11-19 08:59:23 UTC
Sorry, it's a first time I create a patch with git
Comment 7 Albert Astals Cid 2012-11-19 22:24:33 UTC
First of all thanks for the contribution, don't let the comments make you think your contribution is unwanted, it's just that we try to get the better code possible for all the features.

The patches still seem a big mixed https://bugs.freedesktop.org/attachment.cgi?id=70245 and https://bugs.freedesktop.org/attachment.cgi?id=70246 both seem to have the same first "patch" inside.

Anyway, could you explain in text what the patch does? I can read the code but i'd like something longer than the first comment, something like, it saves this there and bla bla.

Also it would be cool if you could send the patches with the least number of whitespace changes, it takes a while to have a look at 
-    HtmlImage(GooString *_fName, GfxState *state)
+  HtmlImage(GooString *_fName, GfxState *state)
or
-      Object obj1;
-      // valid 'F' key -> external file
-      kind = soundExternal;
-      if (getFileSpecNameForPlatform (&tmp, &obj1)) {
-        fileName = obj1.getString()->copy();
-        obj1.free();
-      }
+    	Object obj1;
+    	// valid 'F' key -> external file
+    	kind = soundExternal;
+    	if (getFileSpecNameForPlatform (&tmp, &obj1)) {
+    	  fileName = obj1.getString()->copy();
+    	  obj1.free();
+    	}
and realize the only thing that has changed is the leading space, so it helps making the review faster if these kind of change is not present

Also can you explain the code change in poppler/FileSpec.cc?

In AnnotRichMediaInstance:
 * you have the name variable that is never used
 * type and filespec are not set to NULL if the objects are not present
 * You want to strdup or create a goostring from type since once the Object runs out of references you'll have a pointer pointing to free'd memory
 * Please make subtype be an enum and not an int

In AnnotRichMediaConfiguration:
 * You never free the instances list
 * Same thing for the subtype enum thing
 * Same thing for the strduping of char* you get from a getName

Please fix these mistakes and post the new patches and i'll continue with the review.
Comment 8 Allemand Corentin 2012-11-20 14:04:45 UTC



(In reply to comment #7)
> First of all thanks for the contribution, don't let the comments make you
> think your contribution is unwanted, it's just that we try to get the better
> code possible for all the features.
> 
> The patches still seem a big mixed
> https://bugs.freedesktop.org/attachment.cgi?id=70245 and
> https://bugs.freedesktop.org/attachment.cgi?id=70246 both seem to have the
> same first "patch" inside.
> 
> Anyway, could you explain in text what the patch does? I can read the code
> but i'd like something longer than the first comment, something like, it
> saves this there and bla bla.


Changes in this patch are designed to transform PDF to HTML but keeping the structure of the PDF to find it in the HTML.
In the end, I have all gathered in the same patch

 
> Also it would be cool if you could send the patches with the least number of
> whitespace changes, it takes a while to have a look at 
> -    HtmlImage(GooString *_fName, GfxState *state)
> +  HtmlImage(GooString *_fName, GfxState *state)
> or
> -      Object obj1;
> -      // valid 'F' key -> external file
> -      kind = soundExternal;
> -      if (getFileSpecNameForPlatform (&tmp, &obj1)) {
> -        fileName = obj1.getString()->copy();
> -        obj1.free();
> -      }
> +    	Object obj1;
> +    	// valid 'F' key -> external file
> +    	kind = soundExternal;
> +    	if (getFileSpecNameForPlatform (&tmp, &obj1)) {
> +    	  fileName = obj1.getString()->copy();
> +    	  obj1.free();
> +    	}
> and realize the only thing that has changed is the leading space, so it
> helps making the review faster if these kind of change is not present




> Also can you explain the code change in poppler/FileSpec.cc?

I made ​​the change in the file "FileSpec.cc" because of all the test files I used the old method to read the filename does not work.
Looking at the struture of PDF, I noticed that the tag 'F' was still present.


> In AnnotRichMediaInstance:
>  * you have the name variable that is never used

Clean

>  * type and filespec are not set to NULL if the objects are not present

Change Done

>  * You want to strdup or create a goostring from type since once the Object
> runs out of references you'll have a pointer pointing to free'd memory

Change Done

>  * Please make subtype be an enum and not an int

Change Done

> 
> In AnnotRichMediaConfiguration:
>  * You never free the instances list

Change Done

>  * Same thing for the subtype enum thing

Change Done

>  * Same thing for the strduping of char* you get from a getName

Change Done

> 
> Please fix these mistakes and post the new patches and i'll continue with
> the review.
Comment 9 Allemand Corentin 2012-11-20 14:05:37 UTC
Created attachment 70311 [details] [review]
Extract RichMedia
Comment 10 Allemand Corentin 2012-11-20 14:06:02 UTC
Created attachment 70312 [details] [review]
Extract RichMedia
Comment 11 Albert Astals Cid 2012-11-21 23:05:07 UTC
(In reply to comment #8)
> > Also can you explain the code change in poppler/FileSpec.cc?
> 
> I made ​​the change in the file "FileSpec.cc" because of all the test files
> I used the old method to read the filename does not work.
> Looking at the struture of PDF, I noticed that the tag 'F' was still present.

Why it wouldn't work, it first checks for UF and if it's not there uses F, besides the specification clearly documents that F is deprecated and that UF should be used if both are present

Also i'm going to guess this is one of your first open source contributions and you are making the mistake of dumping everything in the same patch, i.e. 
   [PATCH 6/7] Remove "
does not have anything to do with "Feature : extract media in html format" and you should submit it separately because dumping everything in the same patch makes it harder to review.


-  if (ignore||(complexMode && !xml)) {
+  if (ignore|| !(complexMode && withMedia)) {

Why the !xml was removed?

delete assets; is not enough, check how other goolists are deleted

I'm also a bit confused with 

+  type = richMediaContentType;
   configurations = new GooList();
   assets = new GooList();
   //Type name (Optional; ExtensionLevel 3)
   if (dict->lookup("Type", &obj1)->isName()) {
-    type = obj1.getName();
+    const char *name = obj1.getName();
+    if(!strcmp("RichMediaContent", name)){
+      type = richMediaContentType;
+    }
   }

if the type will always be richMediaContentType, why bother asking the dict?

Also it would be cool if you could squash patch 7 into patch 3, this would make reviewing easier
Comment 12 Allemand Corentin 2012-12-06 15:54:50 UTC
Created attachment 71088 [details] [review]
Patch for extract media
Comment 13 Allemand Corentin 2012-12-06 15:57:14 UTC
Hello,

I fixed all the problems you found. The only problem is at the reading of the file name. When I use it on my PDFs, it is incappable to extract file names with variables "UF".

Is what I can provide you with a PDF that shows you the problem because I do not know what I'm doing wrong.

Thank you in advance
Comment 14 Albert Astals Cid 2012-12-08 17:53:13 UTC
Yes please, if you have a pdf which the code reading the UF field is failing please create a new bug and attach it there.
Comment 15 Albert Astals Cid 2012-12-08 20:04:49 UTC
+  Object obj1;
+  obj1.free();

You don't need to free the object just after creating it (it is not *wrong* per se but it's a bit weird)

AnnotRichMedia::getConfigurations() returns a new GooList everytime but HtmlOutputDev::doProcessRichMedia does not delete the list (i'd say you don't need to delete the contents, but better check it). Also it'd be good if in the header file you mention that the caller of AnnotRichMedia::getConfigurations is the one responsible for deleting the GooList but not the contents of it (in case i'm right)

//ok = gFalse;
Seems like it'd be a good idea to set that?

You have some commented out code in HtmlOutputDev::endPage, maybe remove it?

Besides that is looking quite good :-)
Comment 16 Allemand Corentin 2012-12-09 11:09:54 UTC
Created attachment 71228 [details] [review]
Patch for extract media

(In reply to comment #15)
> +  Object obj1;
> +  obj1.free();
> 
> You don't need to free the object just after creating it (it is not *wrong*
> per se but it's a bit weird)

It's just an oversight after you have cleaned the code

> 
> AnnotRichMedia::getConfigurations() returns a new GooList everytime but
> HtmlOutputDev::doProcessRichMedia does not delete the list (i'd say you
> don't need to delete the contents, but better check it). Also it'd be good
> if in the header file you mention that the caller of
> AnnotRichMedia::getConfigurations is the one responsible for deleting the
> GooList but not the contents of it (in case i'm right)
> 
> //ok = gFalse;
> Seems like it'd be a good idea to set that?

It's just an oversight after you have cleaned the code

> 
> You have some commented out code in HtmlOutputDev::endPage, maybe remove it?

It's just an oversight after you have cleaned the code

> Besides that is looking quite good :-)

Thank you for the time to incorporate changes
Comment 17 Allemand Corentin 2012-12-09 12:14:37 UTC
For problems filename, the problem appears when reading the file names of "rich media". I'll post a new bug featuer when this will be integrated. I think this is the best thing to do.
Comment 18 Albert Astals Cid 2012-12-09 21:48:22 UTC
You have a few TODO in the utils/ folder, do you want us to integrate this with the TODOs or are you planning on working on it "now"?
Comment 19 Allemand Corentin 2012-12-10 08:06:52 UTC
Hello,

It would be preferable for the integrated in your TODO list. I will work on these TODO in a few weeks but for now I do not have time to do so.
Comment 20 Albert Astals Cid 2012-12-12 21:40:02 UTC
Ok, we are in feature freeze (i.e. no more new features) for next version that will be released on December 29, i'll come back to you after that date, we are trying to focus on fixing small bugs until that date.
Comment 21 Albert Astals Cid 2013-01-22 19:14:50 UTC
Silly question, where are these "RichMediaContent" defined? I looked at the PDF 1.7 spec and at the ISO_PDF32000_2008.pdf spec and could not find them
Comment 22 Albert Astals Cid 2013-02-18 22:59:48 UTC
Ping
Comment 23 GitLab Migration User 2018-08-21 11:03:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/482.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.