Summary: | pdftohtml: control over word breaks | ||
---|---|---|---|
Product: | poppler | Reporter: | Ihar Filipau <thephilips> |
Component: | pdftohtml | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | minor | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | the patch, v1 |
Description
Ihar Filipau
2012-03-06 14:45:36 UTC
Created attachment 58283 [details] [review] the patch, v1 Add a control over word break threshold (the best name I could think up). 1. Add a new global variable `double wordBreakThreshold` in the pdftohtml.cc Default value 10 percent Later converted to internal coefficient by dividing by 100. 2. Add new command line parameter: -wbt <fp> Value stored in the wordBreakThreshold variable. 3. After command line is parsed, covert the percentage into a coefficient. 4. HtmlOutputDev.cc, HtmlPage::addChar(): replace the hardcoded `0.1` with the variable. 5. HtmlOutputDev.cc, HtmlPage::coalesce(): replace the hardcoded `0.1` with the variable. 6. Document the parameter in the man page. I was tempted to introduce a new bool function for the word break check, yet: - the functionality is duplicated (as I have understood, the results of word-breaking in addChar() are post-processed and largely overridden by the ::coalesce() method) - there is a TODO in ::addChar() of which validity and applicability I'm not sure. Question, does pdftotext extract the text of those pdf you need to "tweak" correctly? If it does instead of doing this hack should you try to use the same algorithm pdftotext uses? (In reply to comment #2) > Question, does pdftotext extract the text of those pdf you need to "tweak" > correctly? > If it does instead of doing this hack should you try to use the same algorithm > pdftotext uses? No, it doesn't extract the text correctly. pdftotext uses the same 0.1 coefficient. (See the TextOutputDev.cc, define minWordBreakSpace.) It seems pretty much everything else uses the same coefficient too. E.g. I can't search for the split word neither in Adobe Reader nor FoxIt nor Okular. Worth repeating: the PDF I have is effectively broken. But repairing it manually is literally impossible. With the switch I have added, `pdftohtml -xml -wbt 30` repairs literally all words. OK, it incorrectly also glued together few words - but in the main body of the book's text I couldn't find oddities anymore. To be honest i'm not sure it makes sense to add extra complexity code to fix broken files no other viewer supports either. On the other hand it is true that it is not that much new code either. Can you please ask in the mailing list and see what other people think? Commited |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.