Created attachment 125884 [details] pdf produced by Chrome When using "pdftotext -bbox" on PDFs produced by Chrome's page print, the sentenses are not splitted to words. In the pdftotext's output symbols 0xA0 are in between of words instead of spaces (0x20). That might be the reason of sentense not being splitted.
I was mistaken on IRC when I called this a linefeed character. I confused 0xA0 and 0x0A. Chrome is for some reason sometimes using 0xA0 (no-break space) between words. poppler only breaks words on regular 0x20 space so these stay grouped together in the same word. To work around this, we could possibly implement something like icu's u_isUWhiteSpace to check for characters to split on.
Created attachment 126831 [details] [review] [patch] Break words on all whitespace characters Some PDF creators like Chrome use no-break spaces or other whitespace characters between words, causing pdftotext -bbox to not break words as expected. Fix this by breaking words on any character with the Unicode whitespace property.
LGTM, I've just pushed it, thanks!
This has caused 271 "regressions" in popplebot text tests. Carlos have you checked them all?
(In reply to Albert Astals Cid from comment #4) > This has caused 271 "regressions" in popplebot text tests. > > Carlos have you checked them all? Nope, I'll take a look, thanks for the heads up.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.