Bug 97399 - No word splitting for pdfs produced by Chrome
Summary: No word splitting for pdfs produced by Chrome
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-18 17:34 UTC by buktop
Modified: 2016-10-06 15:08 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
pdf produced by Chrome (57.82 KB, application/pdf)
2016-08-18 17:34 UTC, buktop
Details
[patch] Break words on all whitespace characters (2.22 KB, patch)
2016-09-28 16:05 UTC, Jason Crain
Details | Splinter Review

Description buktop 2016-08-18 17:34:18 UTC
Created attachment 125884 [details]
pdf produced by Chrome

When using "pdftotext -bbox" on PDFs produced by Chrome's page print, the sentenses are not splitted to words. In the pdftotext's output symbols 0xA0
are in between of words instead of spaces (0x20). That might be the reason of sentense not being splitted.
Comment 1 Jason Crain 2016-08-18 19:33:35 UTC
I was mistaken on IRC when I called this a linefeed character.  I confused 0xA0 and 0x0A.  Chrome is for some reason sometimes using 0xA0 (no-break space) between words.  poppler only breaks words on regular 0x20 space so these stay grouped together in the same word.  To work around this, we could possibly implement something like icu's u_isUWhiteSpace to check for characters to split on.
Comment 2 Jason Crain 2016-09-28 16:05:34 UTC
Created attachment 126831 [details] [review]
[patch] Break words on all whitespace characters

Some PDF creators like Chrome use no-break spaces or other whitespace characters between words, causing pdftotext -bbox to not break words as expected.  Fix this by breaking words on any character with the Unicode whitespace property.
Comment 3 Carlos Garcia Campos 2016-10-03 16:00:21 UTC
LGTM, I've just pushed it, thanks!
Comment 4 Albert Astals Cid 2016-10-03 20:13:41 UTC
This has caused 271 "regressions" in popplebot text tests.

Carlos have you checked them all?
Comment 5 Carlos Garcia Campos 2016-10-06 15:08:16 UTC
(In reply to Albert Astals Cid from comment #4)
> This has caused 271 "regressions" in popplebot text tests.
> 
> Carlos have you checked them all?

Nope, I'll take a look, thanks for the heads up.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.