Bug 45572 - pdftohtml does not handle Unicode text in outlines
Summary: pdftohtml does not handle Unicode text in outlines
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-02 16:52 UTC by Igor Slepchin
Modified: 2012-02-05 07:00 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
patch (4.35 KB, application/x-gzip)
2012-02-02 17:01 UTC, Igor Slepchin
Details
patch for the outlines unicode bug only (8.32 KB, patch)
2012-02-03 13:43 UTC, Igor Slepchin
Details | Splinter Review
patch for the outlines unicode bug only (with #else for #ifdef DISABLE_OUTLINE) (8.34 KB, patch)
2012-02-03 15:05 UTC, Igor Slepchin
Details | Splinter Review
pdf that demonstrates the bug with unicode in outlines (421.66 KB, application/pdf)
2012-02-03 15:08 UTC, Igor Slepchin
Details

Description Igor Slepchin 2012-02-02 16:52:45 UTC
pdftohtml will currently dump the pdf outlines into generated HTML but that only works properly if the outline text is using PDFDocEncoding and breaks when Unicode is used. Please let me know if you need me to attach a sample pdf that demonstrates this.
Comment 1 Igor Slepchin 2012-02-02 17:01:50 UTC
Created attachment 56557 [details]
patch

the tar contains three patches:

001*.patch fixes the unicode handling in outlines by making use of already existing Outline class rather than parsing the outline anew.

002*.patch fixes a memory leak when dumping html encoding string (there is a leak (which happened once per generated .html file)

003*.patch fixes another leak (which happens once per pdftohtml run).
Comment 2 Albert Astals Cid 2012-02-03 10:50:40 UTC
I have to reject 0002, please don't add more static variables to HtmlOutputDev, it's already bad enough, we want less static variables, not more.

As for the others i'm not having a look at them until you fix 0002. You should have opened 3 different bugs so one does not block the other two ;-)
Comment 3 Igor Slepchin 2012-02-03 13:43:16 UTC
Created attachment 56583 [details] [review]
patch for the outlines unicode bug only

Splitting this is a good idea indeed :p

I'm reattaching just the first patch as it's the only one relevant to this bug per se. The only change compared to 0001*.patch from the original .tar attachment is that it now gracefully handles outline items with no destination.

I'll rework 0002 to get rid of the static var and submit it and 0003 separately. As this'll likely involve nearly all the work needed for getting rid of the other static data members in HtmlOutputDev, would you mind if I get rid of them as well? I'm thinking about just making them all data members of HtmlOutputDevice and then passing a pointer to an instance of it to whoever uses the currently static vars (which is just HtmlPage, really).
Comment 4 Igor Slepchin 2012-02-03 13:49:41 UTC
Just as a heads up, I'm planning to submit another small patch that closes <li> tags in the generated html for the outlines and then another one that generates outlines in -xml mode too. Please let me know if you have any preferences re submitting them through bugzilla/mailing list etc or just think outright they won't be accepted...
Comment 5 Albert Astals Cid 2012-02-03 14:23:20 UTC
We don't really support DISABLE_OUTLINE but your patch is wrong, as you can see in PDFDoc.h if DISABLE_OUTLINE is defined the getOutline() function is not declared, so you need and #else in your code.

Bugzilla is fine for patches.

Additionally it'd be good if you attached a PDF file that gets fixed by this patch.
Comment 6 Igor Slepchin 2012-02-03 15:05:48 UTC
Created attachment 56585 [details] [review]
patch for the outlines unicode bug only (with #else for #ifdef DISABLE_OUTLINE)

(In reply to comment #5)
> <...> if DISABLE_OUTLINE is defined the getOutline() function is not
> declared, so you need and #else in your code.

Indeed, sorry about that. Fix attached.
Comment 7 Igor Slepchin 2012-02-03 15:08:28 UTC
Created attachment 56586 [details]
pdf that demonstrates the bug with unicode in outlines

(In reply to comment #5)
> Additionally it'd be good if you attached a PDF file that gets fixed by this
> patch.

Here you go.
Comment 8 Albert Astals Cid 2012-02-05 07:00:09 UTC
Pushed to the repo.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.