Bug 97144

Summary:	evince: nulls in PDF text cause Invalid UTF-8 encoded text in name warning
Product:	poppler	Reporter:	Jason Crain <jason>
Component:	glib frontend	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	All
See Also:	http://bugs.debian.org/830565
Whiteboard:
i915 platform:		i915 features:
Attachments:	riedinfo_kw_27_2016.pdf Remove-null-characters-from-PDF-text.patch

Description Jason Crain 2016-07-30 08:37:50 UTC

Created attachment 125435 [details]
riedinfo_kw_27_2016.pdf

evince has problems with searching in this PDF.  Searching for the letter 'a' in this PDF fills the terminal with "Invalid UTF-8 encoded text in name" warning messages or with an older version of glib it crashes.

The cause is that the PDF has embedded null characters and the glib frontend does not deal well with that.  poppler_page_get_text returns a shortened string, the length does not match the length from poppler_page_get_text_layout, and when evince tries to display search results it reads outside the buffer and tries to parse random junk as UTF8.

Comment 1 Jason Crain 2016-07-30 08:48:23 UTC

Created attachment 125436 [details] [review]
Remove-null-characters-from-PDF-text.patch

Easiest way to fix this is to not allow null characters in the text.  Attached patch removes null characters in TextPage::addChar.

Comment 2 Albert Astals Cid 2016-07-30 15:40:11 UTC

Should this maybe be done at the glib frontend level?

I've no idea/strong opinion, just random question :D

Comment 3 Jason Crain 2016-07-30 19:41:32 UTC

(In reply to Albert Astals Cid from comment #2)
> Should this maybe be done at the glib frontend level?
> 
> I've no idea/strong opinion, just random question :D

My reasons for putting it in TextOutputDev.cc are that it was easy, it produced a small improvement for pdftotext in a few (broken) PDFs I have, and that I don't think anyone much cares about keeping null chars.

It could go in glib/poppler-page.c but it's a little more work because poppler_page_get_text, poppler_page_get_text_layout, and poppler_page_get_text_attributes all need to be kept in sync so they return the same lengths.

Comment 4 Carlos Garcia Campos 2016-09-03 07:20:26 UTC

Let's try with this simple patch, if anybody complains we can just move it ot the glib frontend. I've just pushed it, thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.