Summary: | evince: nulls in PDF text cause Invalid UTF-8 encoded text in name warning | ||
---|---|---|---|
Product: | poppler | Reporter: | Jason Crain <jason> |
Component: | glib frontend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
See Also: | http://bugs.debian.org/830565 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
riedinfo_kw_27_2016.pdf
Remove-null-characters-from-PDF-text.patch |
Created attachment 125436 [details] [review] Remove-null-characters-from-PDF-text.patch Easiest way to fix this is to not allow null characters in the text. Attached patch removes null characters in TextPage::addChar. Should this maybe be done at the glib frontend level? I've no idea/strong opinion, just random question :D (In reply to Albert Astals Cid from comment #2) > Should this maybe be done at the glib frontend level? > > I've no idea/strong opinion, just random question :D My reasons for putting it in TextOutputDev.cc are that it was easy, it produced a small improvement for pdftotext in a few (broken) PDFs I have, and that I don't think anyone much cares about keeping null chars. It could go in glib/poppler-page.c but it's a little more work because poppler_page_get_text, poppler_page_get_text_layout, and poppler_page_get_text_attributes all need to be kept in sync so they return the same lengths. Let's try with this simple patch, if anybody complains we can just move it ot the glib frontend. I've just pushed it, thanks! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 125435 [details] riedinfo_kw_27_2016.pdf evince has problems with searching in this PDF. Searching for the letter 'a' in this PDF fills the terminal with "Invalid UTF-8 encoded text in name" warning messages or with an older version of glib it crashes. The cause is that the PDF has embedded null characters and the glib frontend does not deal well with that. poppler_page_get_text returns a shortened string, the length does not match the length from poppler_page_get_text_layout, and when evince tries to display search results it reads outside the buffer and tries to parse random junk as UTF8.