Bug 76971

Summary: Problem with non-BMP Unicode characters
Product: poppler Reporter: Behdad Esfahbod <freedesktop>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED WORKSFORME QA Contact:
Severity: normal    
Priority: medium CC: freedesktop
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Sample document

Description Behdad Esfahbod 2014-04-02 23:53:16 UTC
Created attachment 96814 [details]
Sample document

Attached PDF is generated by cairo from printing a gedit document with one character: U+1D780.  Here it is in text: "𝞀".  This is an example of what we call "non-BMP" Unicode character.  Ie. one that has a code > 0xFFFF.  Ie, it doesn't fit in two bytes, which means it doesn't in one UTF-16 codepoint.

Printing the attached PDF from evince to a PDF file fails.  Evince generates the following cairo error:

  cairo context error: input string not valid UTF-8

I think what's happening is that someone somewhere in the poppler chain is not handling UTF-16 surrogate pairs.  Or some other mishandling.
Comment 1 Behdad Esfahbod 2014-04-03 00:09:29 UTC
Humm.  I'm told by others that this is probably fixed in latest version already.  I'm testing on Ubuntu 12.04.  Feel free if it works for you.
Comment 2 Albert Astals Cid 2014-04-03 22:09:35 UTC
Works for me, please try in something that is not 2 years old next time :-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.