Summary: | poppler feeds invalid UTF-8 to cairo | ||
---|---|---|---|
Product: | poppler | Reporter: | Christian Persch (GNOME) <chpe> |
Component: | cairo backend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
increase tolerance for overlapping glyphs
move text to unicode conversion to a separate function |
Description
Christian Persch (GNOME)
2012-08-22 12:10:51 UTC
The problem is that surrogate pairs are not decoded before converting to utf8. The patch https://bugs.freedesktop.org/attachment.cgi?id=58178 (bug 46603 "convert utf-16 to ucs-4 when reading ToUnicode") fixes this issue by moving all instances of the surrogate pair handling to where the UTF-16 characters are read to ensure that the internal Unicode type contains only UTF-32 values. I'll have a look to see if integrating that patch Adrian mention breaks something, "soon" by some definition of "soon" :D Regression in pdftotext output in https://bugs.freedesktop.org/attachment.cgi?id=58045 -In our disk model ̃ -𝑃 of the projective plane, we have obtained four bundles of half +̃ of the projective plane, we have obtained four bundles of half +In our disk model 𝑃 It is true that the original is not perfect, but at least it is in the correct order, your new one exchanges the order of the text (i.e. "In our disk model" has to be before "of the projective plane", not after) Created attachment 66222 [details] [review] increase tolerance for overlapping glyphs This patch fixes the regression. Created attachment 66223 [details] [review] move text to unicode conversion to a separate function As a result of the first patch, ActualText also needs to convert UTF-16 to UCS-4. This patch (from bug 46603 with a small fix) factors out the duplicated code in ActualText and pdfinfo for converting text to unicode. I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change pdftotext output for a lot of files (around 400 in my test suite). It is true that mostly are improvements but with such a huge change i don't feel like putting it in 0.20.x P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs (In reply to comment #6) > I''ve commited this patches, but only to msater (i.e. 0.22.0) since they change > pdftotext output for a lot of files (around 400 in my test suite). It is true > that mostly are improvements but with such a huge change i don't feel like > putting it in 0.20.x > > P.S: My eyes bleed after looking at the diffs of all those pdftotexts outputs Thanks both! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.