Bug 97399

Summary: No word splitting for pdfs produced by Chrome
Product: poppler Reporter: buktop <buktop999>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: buktop999
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: pdf produced by Chrome
[patch] Break words on all whitespace characters

Description buktop 2016-08-18 17:34:18 UTC
Created attachment 125884 [details]
pdf produced by Chrome

When using "pdftotext -bbox" on PDFs produced by Chrome's page print, the sentenses are not splitted to words. In the pdftotext's output symbols 0xA0
are in between of words instead of spaces (0x20). That might be the reason of sentense not being splitted.
Comment 1 Jason Crain 2016-08-18 19:33:35 UTC
I was mistaken on IRC when I called this a linefeed character.  I confused 0xA0 and 0x0A.  Chrome is for some reason sometimes using 0xA0 (no-break space) between words.  poppler only breaks words on regular 0x20 space so these stay grouped together in the same word.  To work around this, we could possibly implement something like icu's u_isUWhiteSpace to check for characters to split on.
Comment 2 Jason Crain 2016-09-28 16:05:34 UTC
Created attachment 126831 [details] [review]
[patch] Break words on all whitespace characters

Some PDF creators like Chrome use no-break spaces or other whitespace characters between words, causing pdftotext -bbox to not break words as expected.  Fix this by breaking words on any character with the Unicode whitespace property.
Comment 3 Carlos Garcia Campos 2016-10-03 16:00:21 UTC
LGTM, I've just pushed it, thanks!
Comment 4 Albert Astals Cid 2016-10-03 20:13:41 UTC
This has caused 271 "regressions" in popplebot text tests.

Carlos have you checked them all?
Comment 5 Carlos Garcia Campos 2016-10-06 15:08:16 UTC
(In reply to Albert Astals Cid from comment #4)
> This has caused 271 "regressions" in popplebot text tests.
> 
> Carlos have you checked them all?

Nope, I'll take a look, thanks for the heads up.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.