46521 – pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.

Bug 46521 - pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.

Summary: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	pdftohtml (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-02-23 06:26 UTC by Henrik Grubbström
Modified:	2012-02-27 06:51 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
PDF where the formulas on page 6 cause the problem. (489.08 KB, application/pdf) 2012-02-23 09:39 UTC, Henrik Grubbström	Details
View All

Description Henrik Grubbström 2012-02-23 06:26:20 UTC

The output from pdftohtml for one of our pdfs contained the byte sequences:

  \355\240\265\355\261\203 and \355\240\265\355\261\210

They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448 (MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210.

Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8 encoded surrogates. From RFC 3629 section 3:

  The definition of UTF-8 prohibits encoding character numbers between U+D800
  and U+DFFF, which are reserved for use with the UTF-16 encoding form (as
  surrogate pairs) and do not directly represent characters. When encoding in
  UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to
  obtain character numbers, which are then encoded in UTF-8 as described above.

and:

  Implementations of the decoding algorithm above MUST protect against decoding
  invalid sequences. For instance, a naive implementation may decode the overlong
  UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C
  ED BE B4 into U+233B4. Decoding invalid sequences may have security
  consequences or cause other problems. See Security Considerations (Section 10)
  below.

Comment 1 Albert Astals Cid 2012-02-23 09:28:20 UTC

Attach a file to reproduce the problem

Comment 2 Henrik Grubbström 2012-02-23 09:39:51 UTC

Created attachment 57539 [details]
PDF where the formulas on page 6 cause the problem.

Comment 3 Albert Astals Cid 2012-02-23 14:48:29 UTC

Fixed, thanks for the report

Comment 4 Henrik Grubbström 2012-02-27 06:51:55 UTC

Fix confirmed working.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.