Bug 91644

Summary:	poppler-cpp text extraction drops characters
Product:	poppler	Reporter:	Hans-Peter Deifel <hpdeifel>
Component:	cpp frontend	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	PDF containing only the string "foobar" Minimal test case Proposed patch More correct minimal test case Proposed patch

Description Hans-Peter Deifel 2015-08-15 15:04:25 UTC

Created attachment 117703 [details]
PDF containing only the string "foobar"

The text extraction in poppler-cpp silently drops characters from the PDF.

To reproduce, compile and link the attached testcase and run it on the attached PDF:

  ./txtextr foobar.pdf

The output should be

  foobar

as that is the only text in the PDF, but it is

  fooba

Comment 1 Hans-Peter Deifel 2015-08-15 15:05:10 UTC

Created attachment 117704 [details]
Minimal test case

Comment 2 Hans-Peter Deifel 2015-08-15 15:14:30 UTC

Created attachment 117705 [details] [review]
Proposed patch

Here is a patch. Unfortunately the conversion code is so hairy that I'm not sure it is correct. It passes all of pdfgrep's tests, though.

Anyway, please review it.

Comment 3 Hans-Peter Deifel 2015-08-15 15:59:58 UTC

Created attachment 117706 [details]
More correct minimal test case

The minimal test case now handles the trailing null byte correctly, but the bug still persists.

Comment 4 Hans-Peter Deifel 2015-08-15 16:09:24 UTC

Created attachment 117707 [details] [review]
Proposed patch

Remove debug output from patch and fix indentation

Comment 5 Albert Astals Cid 2015-08-27 20:38:47 UTC

Pushed!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.