The output from pdftohtml for one of our pdfs contained the byte sequences: \355\240\265\355\261\203 and \355\240\265\355\261\210 They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448 (MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210. Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8 encoded surrogates. From RFC 3629 section 3: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above. and: Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.
Attach a file to reproduce the problem
Created attachment 57539 [details] PDF where the formulas on page 6 cause the problem.
Fixed, thanks for the report
Fix confirmed working.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.