Bug 99506 - pdftotext should filter control characters like "form feed"
Summary: pdftotext should filter control characters like "form feed"
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-23 15:17 UTC by Mike Gerber
Modified: 2018-08-20 22:06 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Example PDF (578.31 KB, application/pdf)
2017-01-23 15:17 UTC, Mike Gerber
Details
evince Screenshot (163.67 KB, image/png)
2017-01-23 15:21 UTC, Mike Gerber
Details
Extracted text from the example PDF (2.50 KB, text/plain)
2017-01-23 15:24 UTC, Mike Gerber
Details

Description Mike Gerber 2017-01-23 15:17:27 UTC
Created attachment 129108 [details]
Example PDF

Currently, pdftotext/TextOutputDev extracts control characters like form feeds from the PDF. These should be filtered, as the users expects form feeds to be inserted by pdftotext alone.

In the attached PDF, there is a form feed character (0xC) extracted between the word "sich" and the following formula. The form feed is - AFAICT - actually a character from the CMSY10 font.
Comment 1 Mike Gerber 2017-01-23 15:21:09 UTC
Created attachment 129110 [details]
evince Screenshot

The problem is also visible in Evince (marked red rectangle). 000C = form feed
Comment 2 Mike Gerber 2017-01-23 15:24:14 UTC
Created attachment 129111 [details]
Extracted text from the example PDF

Output of pdftotext, for the example PDF
Comment 3 GitLab Migration User 2018-08-20 22:06:02 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/191.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.