Summary: | poppler-cpp text extraction drops characters | ||
---|---|---|---|
Product: | poppler | Reporter: | Hans-Peter Deifel <hpdeifel> |
Component: | cpp frontend | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
PDF containing only the string "foobar"
Minimal test case Proposed patch More correct minimal test case Proposed patch |
Created attachment 117704 [details]
Minimal test case
Created attachment 117705 [details] [review] Proposed patch Here is a patch. Unfortunately the conversion code is so hairy that I'm not sure it is correct. It passes all of pdfgrep's tests, though. Anyway, please review it. Created attachment 117706 [details]
More correct minimal test case
The minimal test case now handles the trailing null byte correctly, but the bug still persists.
Created attachment 117707 [details] [review] Proposed patch Remove debug output from patch and fix indentation Pushed! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 117703 [details] PDF containing only the string "foobar" The text extraction in poppler-cpp silently drops characters from the PDF. To reproduce, compile and link the attached testcase and run it on the attached PDF: ./txtextr foobar.pdf The output should be foobar as that is the only text in the PDF, but it is fooba