Bug 46521 - pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.
Summary: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-23 06:26 UTC by Henrik Grubbström
Modified: 2012-02-27 06:51 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
PDF where the formulas on page 6 cause the problem. (489.08 KB, application/pdf)
2012-02-23 09:39 UTC, Henrik Grubbström
Details

Description Henrik Grubbström 2012-02-23 06:26:20 UTC
The output from pdftohtml for one of our pdfs contained the byte sequences:

  \355\240\265\355\261\203 and \355\240\265\355\261\210

They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448 (MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210.

Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8 encoded surrogates. From RFC 3629 section 3:

  The definition of UTF-8 prohibits encoding character numbers between U+D800
  and U+DFFF, which are reserved for use with the UTF-16 encoding form (as
  surrogate pairs) and do not directly represent characters. When encoding in
  UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to
  obtain character numbers, which are then encoded in UTF-8 as described above.

and:

  Implementations of the decoding algorithm above MUST protect against decoding
  invalid sequences. For instance, a naive implementation may decode the overlong
  UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C
  ED BE B4 into U+233B4. Decoding invalid sequences may have security
  consequences or cause other problems. See Security Considerations (Section 10)
  below.
Comment 1 Albert Astals Cid 2012-02-23 09:28:20 UTC
Attach a file to reproduce the problem
Comment 2 Henrik Grubbström 2012-02-23 09:39:51 UTC
Created attachment 57539 [details]
PDF where the formulas on page 6 cause the problem.
Comment 3 Albert Astals Cid 2012-02-23 14:48:29 UTC
Fixed, thanks for the report
Comment 4 Henrik Grubbström 2012-02-27 06:51:55 UTC
Fix confirmed working.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.