Bug 91644

Summary: poppler-cpp text extraction drops characters
Product: poppler Reporter: Hans-Peter Deifel <hpdeifel>
Component: cpp frontendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: PDF containing only the string "foobar"
Minimal test case
Proposed patch
More correct minimal test case
Proposed patch

Description Hans-Peter Deifel 2015-08-15 15:04:25 UTC
Created attachment 117703 [details]
PDF containing only the string "foobar"

The text extraction in poppler-cpp silently drops characters from the PDF.

To reproduce, compile and link the attached testcase and run it on the attached PDF:

  ./txtextr foobar.pdf

The output should be

  foobar

as that is the only text in the PDF, but it is

  fooba
Comment 1 Hans-Peter Deifel 2015-08-15 15:05:10 UTC
Created attachment 117704 [details]
Minimal test case
Comment 2 Hans-Peter Deifel 2015-08-15 15:14:30 UTC
Created attachment 117705 [details] [review]
Proposed patch

Here is a patch. Unfortunately the conversion code is so hairy that I'm not sure it is correct. It passes all of pdfgrep's tests, though.

Anyway, please review it.
Comment 3 Hans-Peter Deifel 2015-08-15 15:59:58 UTC
Created attachment 117706 [details]
More correct minimal test case

The minimal test case now handles the trailing null byte correctly, but the bug still persists.
Comment 4 Hans-Peter Deifel 2015-08-15 16:09:24 UTC
Created attachment 117707 [details] [review]
Proposed patch

Remove debug output from patch and fix indentation
Comment 5 Albert Astals Cid 2015-08-27 20:38:47 UTC
Pushed!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.