Bug 47022

Summary:	pdftohtml: control over word breaks
Product:	poppler	Reporter:	Ihar Filipau <thephilips>
Component:	pdftohtml	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	minor
Priority:	medium
Version:	unspecified
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	the patch, v1

Description Ihar Filipau 2012-03-06 14:45:36 UTC

At the moment poppler's pdftohtml, like inherited from the Xpdf, uses the following formula to identify the word break (inside the HtmlOutputDev.cc, search for "0.1"):

: fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr->yMax - curStr->yMin)

I had to convert recently a PDF (as produced by WinWord, obviously) where kerning/whatever went really wrong and lots of words (about 2.9K of them) were rendered by the pdftohtml split. E.g. (German text) "auf" became "au f", "rechte" became "recht e" and so on.

Please provide a command line option to control the behavior of the word breaking.

As I have understood, there is not much what can be done - except only allowing the adjustment of the factor used - 0.1. If I have understood correctly the meaning: word break if distance between characters is more than 10% of character's height. In my case, setting it higher to e.g. 0.15 or 0.2 could have allowed me to workaround the bad kerning/etc and reduce the amount of work to be done in the editing later.

Comment 1 Ihar Filipau 2012-03-11 07:11:10 UTC

Created attachment 58283 [details] [review]
the patch, v1

Add a control over word break threshold (the best name I could think up).

1. Add a new global variable `double wordBreakThreshold` in the pdftohtml.cc
   Default value 10 percent
   Later converted to internal coefficient by dividing by 100.

2. Add new command line parameter: -wbt <fp>
   Value stored in the wordBreakThreshold variable.

3. After command line is parsed, covert the percentage into a coefficient.

4. HtmlOutputDev.cc, HtmlPage::addChar(): replace the hardcoded `0.1` with
   the variable.

5. HtmlOutputDev.cc, HtmlPage::coalesce(): replace the hardcoded `0.1` with
   the variable.

6. Document the parameter in the man page.

I was tempted to introduce a new bool function for the word break check, yet:

- the functionality is duplicated (as I have understood, the results of word-breaking in addChar() are post-processed and largely overridden by the ::coalesce() method)

- there is a TODO in ::addChar() of which validity and applicability I'm not sure.

Comment 2 Albert Astals Cid 2012-03-11 16:00:14 UTC

Question, does pdftotext extract the text of those pdf you need to "tweak" correctly?
If it does instead of doing this hack should you try to use the same algorithm pdftotext uses?

Comment 3 Ihar Filipau 2012-03-11 17:21:38 UTC

(In reply to comment #2)
> Question, does pdftotext extract the text of those pdf you need to "tweak"
> correctly?
> If it does instead of doing this hack should you try to use the same algorithm
> pdftotext uses?

No, it doesn't extract the text correctly.

pdftotext uses the same 0.1 coefficient. (See the TextOutputDev.cc, define minWordBreakSpace.)

It seems pretty much everything else uses the same coefficient too. E.g. I can't search for the split word neither in Adobe Reader nor FoxIt nor Okular.

Worth repeating: the PDF I have is effectively broken. But repairing it manually is literally impossible. With the switch I have added, `pdftohtml -xml -wbt 30` repairs literally all words. OK, it incorrectly also glued together few words - but in the main body of the book's text I couldn't find oddities anymore.

Comment 4 Albert Astals Cid 2012-03-12 15:14:38 UTC

To be honest i'm not sure it makes sense to add extra complexity code to fix broken files no other viewer supports either.

On the other hand it is true that it is not that much new code either.

Can you please ask in the mailing list and see what other people think?

Comment 5 Albert Astals Cid 2012-03-13 15:55:06 UTC

Commited

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.