Bug 56226 - Poppler does not guard against invalid utf-8
Summary: Poppler does not guard against invalid utf-8
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: cairo backend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-20 13:01 UTC by Benjamin Berg
Modified: 2012-11-09 18:30 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
ugly workaround (1.26 KB, text/plain)
2012-10-20 13:01 UTC, Benjamin Berg
Details
The broken page (302.51 KB, text/plain)
2012-10-21 09:46 UTC, Benjamin Berg
Details
Don't allow invalid unicode to be passed to backends (2.52 KB, patch)
2012-10-28 01:50 UTC, Adrian Johnson
Details | Splinter Review

Description Benjamin Berg 2012-10-20 13:01:09 UTC
Created attachment 68850 [details]
ugly workaround

I have a PDF file, that apparently contains the "unicode character" 0xffff. Obviously this is an invalid character, but poppler insists in feeding it over to cairo.

My guess is that the PDF file is broken in some way, unfortunately I am not able to provide the file in question because I don't have enough rights to make it public. I am not even able to extract that single page, because pdftk refuses to open the file.

I am attaching a patch that works around the issue. Not a very nice patch in any way, but it gets the job done. The patch simply copies the validity check from cairo.

This is what pdftotext prints for the section in question. I think the U+FFFF characters are scaled {}. ie. similar to what LaTeX would create for:
  Im\left\{ \frac{S_{Last}}{30kVA} \right\}

The Text:
"""
Im

<U+FFFF>

S Last
30kV A

<U+FFFF>
"""
Comment 1 James Cloos 2012-10-20 22:54:11 UTC
If pdftk fails to extract the page, try using mupdfclean from mupdf¹.

1] http://mupdf.com/
   git://git.ghostscript.com/mupdf.git
Comment 2 Thomas Freitag 2012-10-21 07:29:37 UTC
If the PDF is not password protected You can also use pdfseparate which is part of the poppler utils to extract the page. Then we are sure that the content of that page is definitely not changed.
If You're even not able to share the page because You haven't the rights, perhaps You can provide the page in private. Otherwise there is no chance to check Your patch and we have to wait until we get another sample.
Comment 3 Benjamin Berg 2012-10-21 09:46:28 UTC
Created attachment 68865 [details]
The broken page

Ah, did not know about pdfseparate.

I am fine with providing that particular page; just not the whole document.
Comment 4 Thomas Freitag 2012-10-27 09:24:17 UTC
Your patch doesn't apply to git master anymore, and far as I can see it's already solved: The mapping is no more done in an output device anymore, but is done in UnicodeMap.cc.
Comment 5 Benjamin Berg 2012-10-27 20:31:26 UTC
The bug still exists on master, and the patch can easily be updated. I doubt though that the solution that is used in the patch is acceptable though.


So, I just looked at the PDF a bit more. The image uses font 3.1 (and 7.1) and the characters are explicitly mapped to the unicode character uFFFF:

Excerpt from the unicode mapping object (stream 695 in the PDF) for the font in question:
2 beginbfrange
<21><21><ffff>
<22><22><ffff>
endbfrange

This PDF is *broken*. Unicode 0xffff is an invalid character by specification, and poppler (including master) passes it trough to cairo. And cairo does not like uFFFF.


So, poppler needs to be more careful, at least when passing the unicode string to cairo. It would also work to silently change the ffff to eg. fffd (replacement character) while loading the unicode map. I don't know what the best solution is here.
Comment 6 Adrian Johnson 2012-10-28 01:50:32 UTC
Created attachment 69171 [details] [review]
Don't allow invalid unicode to be passed to backends

This patch is a more general solution that validates the unicode characters at the source.
Comment 7 Benjamin Berg 2012-10-28 13:38:42 UTC
The patch works perfectly. As expected, the output of pdftotext changes slightly as it now outputs the replacement character. But I guess that this can be considered a feature.
Comment 8 Adrian Johnson 2012-10-29 22:01:09 UTC
That feature is intentional. We should not be outputting invalid unicode characters from pdftotext.
Comment 9 Albert Astals Cid 2012-11-02 23:55:44 UTC
Adrian, looks good to me, are you commiting?
Comment 10 Adrian Johnson 2012-11-03 00:17:40 UTC
Pushed.
Comment 11 Pacho Ramos 2012-11-09 13:48:39 UTC
(In reply to comment #10)
> Pushed.

Could this be pushed to 0.20 branch also? Thanks
Comment 12 Albert Astals Cid 2012-11-09 18:30:44 UTC
I don't plan doing any other 0.20 release so we can concentrate on getting 0.22 out of the door


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.