Bug 107235 - Bug fixes, emit more font info in pdftohtml
Summary: Bug fixes, emit more font info in pdftohtml
Status: RESOLVED INVALID
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-15 16:27 UTC by ulatekh
Modified: 2018-07-19 20:45 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Fix possible uninitialized variable & dangling reference in HtmlFont (1.12 KB, patch)
2018-07-15 16:27 UTC, ulatekh
Details | Splinter Review
Fix HtmlFont::HtmlFilter to not lose tabs (1.31 KB, patch)
2018-07-15 16:28 UTC, ulatekh
Details | Splinter Review
Emit more font information when pdftohtml is run with -xml (7.68 KB, patch)
2018-07-15 16:28 UTC, ulatekh
Details | Splinter Review

Description ulatekh 2018-07-15 16:27:39 UTC
Created attachment 140641 [details] [review]
Fix possible uninitialized variable & dangling reference in HtmlFont

I'm about to use pdftohtml to extract information from PDFs and organize the results into a database, so I had a chance to dig through the code.

I happened to notice a possible uninitialized variable, and possible dangling reference, in HtmlFont. The first patch fixes that.

I've had a long-standing problem with qpdfview (which uses poppler) sometimes copying text out of PDFs incorrectly -- the text copies, but all of the spaces are missing. After reproducing it with a PDF, I tracked the problem down to the PDF using tabs where it probably should have used spaces. The second patch fixes HtmlFont::HtmlFilter() to convert incoming tabs to spaces, instead of removing the whitespace completely.

The third patch merely emits more information in the <fontspec> elements when pdftohtml is run with -xml. The PDFs I'm trying to analyze appear to be pretty consistent with their font usage, to the point where I can use them to infer the text's meaning. But I needed more information in the <fontspec> to do that, and this patch does that for me.

Please consider these for inclusion into the project.
Comment 1 ulatekh 2018-07-15 16:28:22 UTC
Created attachment 140642 [details] [review]
Fix HtmlFont::HtmlFilter to not lose tabs
Comment 2 ulatekh 2018-07-15 16:28:49 UTC
Created attachment 140643 [details] [review]
Emit more font information when pdftohtml is run with -xml
Comment 3 Albert Astals Cid 2018-07-19 20:45:21 UTC
Can you please post the three patches separately? 

This way we can close one bug when the patch for it lands and maybe the other patches are not yet landed because there's some question about them.

Makes tracking of what is left to do and what is not much easier.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.