Bug 46521

Summary: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.
Product: poppler Reporter: Henrik Grubbström <grubba>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: PDF where the formulas on page 6 cause the problem.

Description Henrik Grubbström 2012-02-23 06:26:20 UTC
The output from pdftohtml for one of our pdfs contained the byte sequences:

  \355\240\265\355\261\203 and \355\240\265\355\261\210

They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448 (MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210.

Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8 encoded surrogates. From RFC 3629 section 3:

  The definition of UTF-8 prohibits encoding character numbers between U+D800
  and U+DFFF, which are reserved for use with the UTF-16 encoding form (as
  surrogate pairs) and do not directly represent characters. When encoding in
  UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to
  obtain character numbers, which are then encoded in UTF-8 as described above.

and:

  Implementations of the decoding algorithm above MUST protect against decoding
  invalid sequences. For instance, a naive implementation may decode the overlong
  UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C
  ED BE B4 into U+233B4. Decoding invalid sequences may have security
  consequences or cause other problems. See Security Considerations (Section 10)
  below.
Comment 1 Albert Astals Cid 2012-02-23 09:28:20 UTC
Attach a file to reproduce the problem
Comment 2 Henrik Grubbström 2012-02-23 09:39:51 UTC
Created attachment 57539 [details]
PDF where the formulas on page 6 cause the problem.
Comment 3 Albert Astals Cid 2012-02-23 14:48:29 UTC
Fixed, thanks for the report
Comment 4 Henrik Grubbström 2012-02-27 06:51:55 UTC
Fix confirmed working.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.