Summary: | pdftotext should filter control characters like "form feed" | ||
---|---|---|---|
Product: | poppler | Reporter: | Mike Gerber <mike> |
Component: | utils | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Example PDF
evince Screenshot Extracted text from the example PDF |
Created attachment 129110 [details]
evince Screenshot
The problem is also visible in Evince (marked red rectangle). 000C = form feed
Created attachment 129111 [details]
Extracted text from the example PDF
Output of pdftotext, for the example PDF
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/191. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 129108 [details] Example PDF Currently, pdftotext/TextOutputDev extracts control characters like form feeds from the PDF. These should be filtered, as the users expects form feeds to be inserted by pdftotext alone. In the attached PDF, there is a form feed character (0xC) extracted between the word "sich" and the following formula. The form feed is - AFAICT - actually a character from the CMSY10 font.