Created attachment 69663 [details] [review] Patch for extract media I added an argument on pdftohtml to extract the images separately. This argument also extracts videos and sounds from RichMedia
Hi, thanks for the patches, i'm confused though, your "patch extract AnnotRichMedia.diff" contains patches by Fabio (not done by you) Can you please clean them up? Also can you please attach the patches in text form, not inside a zip?
Created attachment 69989 [details] [review] Extract AnnotRichMedia
Created attachment 69990 [details] [review] extract media
Created attachment 70245 [details] [review] Extract Pictures
Created attachment 70246 [details] [review] Extract RichMedia
Sorry, it's a first time I create a patch with git
First of all thanks for the contribution, don't let the comments make you think your contribution is unwanted, it's just that we try to get the better code possible for all the features. The patches still seem a big mixed https://bugs.freedesktop.org/attachment.cgi?id=70245 and https://bugs.freedesktop.org/attachment.cgi?id=70246 both seem to have the same first "patch" inside. Anyway, could you explain in text what the patch does? I can read the code but i'd like something longer than the first comment, something like, it saves this there and bla bla. Also it would be cool if you could send the patches with the least number of whitespace changes, it takes a while to have a look at - HtmlImage(GooString *_fName, GfxState *state) + HtmlImage(GooString *_fName, GfxState *state) or - Object obj1; - // valid 'F' key -> external file - kind = soundExternal; - if (getFileSpecNameForPlatform (&tmp, &obj1)) { - fileName = obj1.getString()->copy(); - obj1.free(); - } + Object obj1; + // valid 'F' key -> external file + kind = soundExternal; + if (getFileSpecNameForPlatform (&tmp, &obj1)) { + fileName = obj1.getString()->copy(); + obj1.free(); + } and realize the only thing that has changed is the leading space, so it helps making the review faster if these kind of change is not present Also can you explain the code change in poppler/FileSpec.cc? In AnnotRichMediaInstance: * you have the name variable that is never used * type and filespec are not set to NULL if the objects are not present * You want to strdup or create a goostring from type since once the Object runs out of references you'll have a pointer pointing to free'd memory * Please make subtype be an enum and not an int In AnnotRichMediaConfiguration: * You never free the instances list * Same thing for the subtype enum thing * Same thing for the strduping of char* you get from a getName Please fix these mistakes and post the new patches and i'll continue with the review.
(In reply to comment #7) > First of all thanks for the contribution, don't let the comments make you > think your contribution is unwanted, it's just that we try to get the better > code possible for all the features. > > The patches still seem a big mixed > https://bugs.freedesktop.org/attachment.cgi?id=70245 and > https://bugs.freedesktop.org/attachment.cgi?id=70246 both seem to have the > same first "patch" inside. > > Anyway, could you explain in text what the patch does? I can read the code > but i'd like something longer than the first comment, something like, it > saves this there and bla bla. Changes in this patch are designed to transform PDF to HTML but keeping the structure of the PDF to find it in the HTML. In the end, I have all gathered in the same patch > Also it would be cool if you could send the patches with the least number of > whitespace changes, it takes a while to have a look at > - HtmlImage(GooString *_fName, GfxState *state) > + HtmlImage(GooString *_fName, GfxState *state) > or > - Object obj1; > - // valid 'F' key -> external file > - kind = soundExternal; > - if (getFileSpecNameForPlatform (&tmp, &obj1)) { > - fileName = obj1.getString()->copy(); > - obj1.free(); > - } > + Object obj1; > + // valid 'F' key -> external file > + kind = soundExternal; > + if (getFileSpecNameForPlatform (&tmp, &obj1)) { > + fileName = obj1.getString()->copy(); > + obj1.free(); > + } > and realize the only thing that has changed is the leading space, so it > helps making the review faster if these kind of change is not present > Also can you explain the code change in poppler/FileSpec.cc? I made the change in the file "FileSpec.cc" because of all the test files I used the old method to read the filename does not work. Looking at the struture of PDF, I noticed that the tag 'F' was still present. > In AnnotRichMediaInstance: > * you have the name variable that is never used Clean > * type and filespec are not set to NULL if the objects are not present Change Done > * You want to strdup or create a goostring from type since once the Object > runs out of references you'll have a pointer pointing to free'd memory Change Done > * Please make subtype be an enum and not an int Change Done > > In AnnotRichMediaConfiguration: > * You never free the instances list Change Done > * Same thing for the subtype enum thing Change Done > * Same thing for the strduping of char* you get from a getName Change Done > > Please fix these mistakes and post the new patches and i'll continue with > the review.
Created attachment 70311 [details] [review] Extract RichMedia
Created attachment 70312 [details] [review] Extract RichMedia
(In reply to comment #8) > > Also can you explain the code change in poppler/FileSpec.cc? > > I made the change in the file "FileSpec.cc" because of all the test files > I used the old method to read the filename does not work. > Looking at the struture of PDF, I noticed that the tag 'F' was still present. Why it wouldn't work, it first checks for UF and if it's not there uses F, besides the specification clearly documents that F is deprecated and that UF should be used if both are present Also i'm going to guess this is one of your first open source contributions and you are making the mistake of dumping everything in the same patch, i.e. [PATCH 6/7] Remove " does not have anything to do with "Feature : extract media in html format" and you should submit it separately because dumping everything in the same patch makes it harder to review. - if (ignore||(complexMode && !xml)) { + if (ignore|| !(complexMode && withMedia)) { Why the !xml was removed? delete assets; is not enough, check how other goolists are deleted I'm also a bit confused with + type = richMediaContentType; configurations = new GooList(); assets = new GooList(); //Type name (Optional; ExtensionLevel 3) if (dict->lookup("Type", &obj1)->isName()) { - type = obj1.getName(); + const char *name = obj1.getName(); + if(!strcmp("RichMediaContent", name)){ + type = richMediaContentType; + } } if the type will always be richMediaContentType, why bother asking the dict? Also it would be cool if you could squash patch 7 into patch 3, this would make reviewing easier
Created attachment 71088 [details] [review] Patch for extract media
Hello, I fixed all the problems you found. The only problem is at the reading of the file name. When I use it on my PDFs, it is incappable to extract file names with variables "UF". Is what I can provide you with a PDF that shows you the problem because I do not know what I'm doing wrong. Thank you in advance
Yes please, if you have a pdf which the code reading the UF field is failing please create a new bug and attach it there.
+ Object obj1; + obj1.free(); You don't need to free the object just after creating it (it is not *wrong* per se but it's a bit weird) AnnotRichMedia::getConfigurations() returns a new GooList everytime but HtmlOutputDev::doProcessRichMedia does not delete the list (i'd say you don't need to delete the contents, but better check it). Also it'd be good if in the header file you mention that the caller of AnnotRichMedia::getConfigurations is the one responsible for deleting the GooList but not the contents of it (in case i'm right) //ok = gFalse; Seems like it'd be a good idea to set that? You have some commented out code in HtmlOutputDev::endPage, maybe remove it? Besides that is looking quite good :-)
Created attachment 71228 [details] [review] Patch for extract media (In reply to comment #15) > + Object obj1; > + obj1.free(); > > You don't need to free the object just after creating it (it is not *wrong* > per se but it's a bit weird) It's just an oversight after you have cleaned the code > > AnnotRichMedia::getConfigurations() returns a new GooList everytime but > HtmlOutputDev::doProcessRichMedia does not delete the list (i'd say you > don't need to delete the contents, but better check it). Also it'd be good > if in the header file you mention that the caller of > AnnotRichMedia::getConfigurations is the one responsible for deleting the > GooList but not the contents of it (in case i'm right) > > //ok = gFalse; > Seems like it'd be a good idea to set that? It's just an oversight after you have cleaned the code > > You have some commented out code in HtmlOutputDev::endPage, maybe remove it? It's just an oversight after you have cleaned the code > Besides that is looking quite good :-) Thank you for the time to incorporate changes
For problems filename, the problem appears when reading the file names of "rich media". I'll post a new bug featuer when this will be integrated. I think this is the best thing to do.
You have a few TODO in the utils/ folder, do you want us to integrate this with the TODOs or are you planning on working on it "now"?
Hello, It would be preferable for the integrated in your TODO list. I will work on these TODO in a few weeks but for now I do not have time to do so.
Ok, we are in feature freeze (i.e. no more new features) for next version that will be released on December 29, i'll come back to you after that date, we are trying to focus on fixing small bugs until that date.
Silly question, where are these "RichMediaContent" defined? I looked at the PDF 1.7 spec and at the ISO_PDF32000_2008.pdf spec and could not find them
Ping
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/482.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.