Bug 99506

Summary: pdftotext should filter control characters like "form feed"
Product: poppler Reporter: Mike Gerber <mike>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Example PDF
evince Screenshot
Extracted text from the example PDF

Description Mike Gerber 2017-01-23 15:17:27 UTC
Created attachment 129108 [details]
Example PDF

Currently, pdftotext/TextOutputDev extracts control characters like form feeds from the PDF. These should be filtered, as the users expects form feeds to be inserted by pdftotext alone.

In the attached PDF, there is a form feed character (0xC) extracted between the word "sich" and the following formula. The form feed is - AFAICT - actually a character from the CMSY10 font.
Comment 1 Mike Gerber 2017-01-23 15:21:09 UTC
Created attachment 129110 [details]
evince Screenshot

The problem is also visible in Evince (marked red rectangle). 000C = form feed
Comment 2 Mike Gerber 2017-01-23 15:24:14 UTC
Created attachment 129111 [details]
Extracted text from the example PDF

Output of pdftotext, for the example PDF
Comment 3 GitLab Migration User 2018-08-20 22:06:02 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/191.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.