97399 – No word splitting for pdfs produced by Chrome

Bug 97399 - No word splitting for pdfs produced by Chrome

Summary: No word splitting for pdfs produced by Chrome

Status:	RESOLVED FIXED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-08-18 17:34 UTC by buktop
Modified:	2016-10-06 15:08 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
pdf produced by Chrome (57.82 KB, application/pdf) 2016-08-18 17:34 UTC, buktop	Details
[patch] Break words on all whitespace characters (2.22 KB, patch) 2016-09-28 16:05 UTC, Jason Crain	Details \| Splinter Review
View All

Description buktop 2016-08-18 17:34:18 UTC

Created attachment 125884 [details]
pdf produced by Chrome

When using "pdftotext -bbox" on PDFs produced by Chrome's page print, the sentenses are not splitted to words. In the pdftotext's output symbols 0xA0
are in between of words instead of spaces (0x20). That might be the reason of sentense not being splitted.

Comment 1 Jason Crain 2016-08-18 19:33:35 UTC

I was mistaken on IRC when I called this a linefeed character.  I confused 0xA0 and 0x0A.  Chrome is for some reason sometimes using 0xA0 (no-break space) between words.  poppler only breaks words on regular 0x20 space so these stay grouped together in the same word.  To work around this, we could possibly implement something like icu's u_isUWhiteSpace to check for characters to split on.

Comment 2 Jason Crain 2016-09-28 16:05:34 UTC

Created attachment 126831 [details] [review]
[patch] Break words on all whitespace characters

Some PDF creators like Chrome use no-break spaces or other whitespace characters between words, causing pdftotext -bbox to not break words as expected.  Fix this by breaking words on any character with the Unicode whitespace property.

Comment 3 Carlos Garcia Campos 2016-10-03 16:00:21 UTC

LGTM, I've just pushed it, thanks!

Comment 4 Albert Astals Cid 2016-10-03 20:13:41 UTC

This has caused 271 "regressions" in popplebot text tests.

Carlos have you checked them all?

Comment 5 Carlos Garcia Campos 2016-10-06 15:08:16 UTC

(In reply to Albert Astals Cid from comment #4)
> This has caused 271 "regressions" in popplebot text tests.
> 
> Carlos have you checked them all?

Nope, I'll take a look, thanks for the heads up.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.