Bug 23075

Summary: pdfinfo can produce invalid UTF-8
Product: poppler Reporter: Jakub Wilk <jwilk>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: pdfinfo - decode surrogate pairs

Description Jakub Wilk 2009-08-01 06:34:38 UTC
(Tested with poppler 0.10.6.)

pdfinfo does not properly encode Unicode characters outside the BMP:

$ locale charmap
UTF-8

$ wget -q 'http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;att=1;bug=525309' -O utf16nonbmp.pdf

$ pdfinfo utf16nonbmp.pdf | iconv -f UTF-8 -t UTF-32 >/dev/null
iconv: illegal input sequence at position 16
Comment 1 Adrian Johnson 2012-02-21 04:12:50 UTC
Created attachment 57386 [details] [review]
pdfinfo - decode surrogate pairs

Patch to fix.
Comment 2 Albert Astals Cid 2012-02-21 15:04:54 UTC
Adrian the math in your patch was wrong, i've commited a fixed version. Thanks for finding the lead!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.