Bug 91644 - poppler-cpp text extraction drops characters
Summary: poppler-cpp text extraction drops characters
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: cpp frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-15 15:04 UTC by Hans-Peter Deifel
Modified: 2015-08-27 20:38 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
PDF containing only the string "foobar" (10.48 KB, text/plain)
2015-08-15 15:04 UTC, Hans-Peter Deifel
Details
Minimal test case (366 bytes, text/plain)
2015-08-15 15:05 UTC, Hans-Peter Deifel
Details
Proposed patch (2.64 KB, patch)
2015-08-15 15:14 UTC, Hans-Peter Deifel
Details | Splinter Review
More correct minimal test case (461 bytes, text/plain)
2015-08-15 15:59 UTC, Hans-Peter Deifel
Details
Proposed patch (2.52 KB, patch)
2015-08-15 16:09 UTC, Hans-Peter Deifel
Details | Splinter Review

Description Hans-Peter Deifel 2015-08-15 15:04:25 UTC
Created attachment 117703 [details]
PDF containing only the string "foobar"

The text extraction in poppler-cpp silently drops characters from the PDF.

To reproduce, compile and link the attached testcase and run it on the attached PDF:

  ./txtextr foobar.pdf

The output should be

  foobar

as that is the only text in the PDF, but it is

  fooba
Comment 1 Hans-Peter Deifel 2015-08-15 15:05:10 UTC
Created attachment 117704 [details]
Minimal test case
Comment 2 Hans-Peter Deifel 2015-08-15 15:14:30 UTC
Created attachment 117705 [details] [review]
Proposed patch

Here is a patch. Unfortunately the conversion code is so hairy that I'm not sure it is correct. It passes all of pdfgrep's tests, though.

Anyway, please review it.
Comment 3 Hans-Peter Deifel 2015-08-15 15:59:58 UTC
Created attachment 117706 [details]
More correct minimal test case

The minimal test case now handles the trailing null byte correctly, but the bug still persists.
Comment 4 Hans-Peter Deifel 2015-08-15 16:09:24 UTC
Created attachment 117707 [details] [review]
Proposed patch

Remove debug output from patch and fix indentation
Comment 5 Albert Astals Cid 2015-08-27 20:38:47 UTC
Pushed!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.