Bug 102651 - pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20
Summary: pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-11 09:31 UTC by Daniel Flipo
Modified: 2018-08-20 22:06 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
PDF file with non-breaking spaces to be preserved (4.69 KB, application/pdf)
2017-09-11 09:31 UTC, Daniel Flipo
Details

Description Daniel Flipo 2017-09-11 09:31:29 UTC
Created attachment 134154 [details]
PDF file with non-breaking spaces to be preserved

Correction of bug #97399 lead to add non-breaking spaces U+A0 and U+202F to function UnicodeIsWhitespace which holds the list of all spaces used to break lines into words.

As a result, these non-breaking spaces are converted into breakable U+20 spaces by  pdftotext. In some cases (ties like Mr Bean, high punctuation in French, etc.) these non-breaking spaces are intentionally added and should be preserved as such in the text or html output.

An option to pdftotext enabling to remove these two spaces from UnicodeIsWhitespace would solve the issue.

I append a a small PDF file with those non-breaking spaces for testing.
Comment 1 GitLab Migration User 2018-08-20 22:06:16 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/194.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.