Bug 97144 - evince: nulls in PDF text cause Invalid UTF-8 encoded text in name warning
Summary: evince: nulls in PDF text cause Invalid UTF-8 encoded text in name warning
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-30 08:37 UTC by Jason Crain
Modified: 2016-09-03 07:20 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
riedinfo_kw_27_2016.pdf (8.27 MB, application/pdf)
2016-07-30 08:37 UTC, Jason Crain
Details
Remove-null-characters-from-PDF-text.patch (907 bytes, patch)
2016-07-30 08:48 UTC, Jason Crain
Details | Splinter Review

Description Jason Crain 2016-07-30 08:37:50 UTC
Created attachment 125435 [details]
riedinfo_kw_27_2016.pdf

evince has problems with searching in this PDF.  Searching for the letter 'a' in this PDF fills the terminal with "Invalid UTF-8 encoded text in name" warning messages or with an older version of glib it crashes.

The cause is that the PDF has embedded null characters and the glib frontend does not deal well with that.  poppler_page_get_text returns a shortened string, the length does not match the length from poppler_page_get_text_layout, and when evince tries to display search results it reads outside the buffer and tries to parse random junk as UTF8.
Comment 1 Jason Crain 2016-07-30 08:48:23 UTC
Created attachment 125436 [details] [review]
Remove-null-characters-from-PDF-text.patch

Easiest way to fix this is to not allow null characters in the text.  Attached patch removes null characters in TextPage::addChar.
Comment 2 Albert Astals Cid 2016-07-30 15:40:11 UTC
Should this maybe be done at the glib frontend level?

I've no idea/strong opinion, just random question :D
Comment 3 Jason Crain 2016-07-30 19:41:32 UTC
(In reply to Albert Astals Cid from comment #2)
> Should this maybe be done at the glib frontend level?
> 
> I've no idea/strong opinion, just random question :D

My reasons for putting it in TextOutputDev.cc are that it was easy, it produced a small improvement for pdftotext in a few (broken) PDFs I have, and that I don't think anyone much cares about keeping null chars.

It could go in glib/poppler-page.c but it's a little more work because poppler_page_get_text, poppler_page_get_text_layout, and poppler_page_get_text_attributes all need to be kept in sync so they return the same lengths.
Comment 4 Carlos Garcia Campos 2016-09-03 07:20:26 UTC
Let's try with this simple patch, if anybody complains we can just move it ot the glib frontend. I've just pushed it, thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.