pdftohtml will currently dump the pdf outlines into generated HTML but that only works properly if the outline text is using PDFDocEncoding and breaks when Unicode is used. Please let me know if you need me to attach a sample pdf that demonstrates this.
Created attachment 56557 [details] patch the tar contains three patches: 001*.patch fixes the unicode handling in outlines by making use of already existing Outline class rather than parsing the outline anew. 002*.patch fixes a memory leak when dumping html encoding string (there is a leak (which happened once per generated .html file) 003*.patch fixes another leak (which happens once per pdftohtml run).
I have to reject 0002, please don't add more static variables to HtmlOutputDev, it's already bad enough, we want less static variables, not more. As for the others i'm not having a look at them until you fix 0002. You should have opened 3 different bugs so one does not block the other two ;-)
Created attachment 56583 [details] [review] patch for the outlines unicode bug only Splitting this is a good idea indeed :p I'm reattaching just the first patch as it's the only one relevant to this bug per se. The only change compared to 0001*.patch from the original .tar attachment is that it now gracefully handles outline items with no destination. I'll rework 0002 to get rid of the static var and submit it and 0003 separately. As this'll likely involve nearly all the work needed for getting rid of the other static data members in HtmlOutputDev, would you mind if I get rid of them as well? I'm thinking about just making them all data members of HtmlOutputDevice and then passing a pointer to an instance of it to whoever uses the currently static vars (which is just HtmlPage, really).
Just as a heads up, I'm planning to submit another small patch that closes <li> tags in the generated html for the outlines and then another one that generates outlines in -xml mode too. Please let me know if you have any preferences re submitting them through bugzilla/mailing list etc or just think outright they won't be accepted...
We don't really support DISABLE_OUTLINE but your patch is wrong, as you can see in PDFDoc.h if DISABLE_OUTLINE is defined the getOutline() function is not declared, so you need and #else in your code. Bugzilla is fine for patches. Additionally it'd be good if you attached a PDF file that gets fixed by this patch.
Created attachment 56585 [details] [review] patch for the outlines unicode bug only (with #else for #ifdef DISABLE_OUTLINE) (In reply to comment #5) > <...> if DISABLE_OUTLINE is defined the getOutline() function is not > declared, so you need and #else in your code. Indeed, sorry about that. Fix attached.
Created attachment 56586 [details] pdf that demonstrates the bug with unicode in outlines (In reply to comment #5) > Additionally it'd be good if you attached a PDF file that gets fixed by this > patch. Here you go.
Pushed to the repo.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.