Created attachment 140641 [details] [review] Fix possible uninitialized variable & dangling reference in HtmlFont I'm about to use pdftohtml to extract information from PDFs and organize the results into a database, so I had a chance to dig through the code. I happened to notice a possible uninitialized variable, and possible dangling reference, in HtmlFont. The first patch fixes that. I've had a long-standing problem with qpdfview (which uses poppler) sometimes copying text out of PDFs incorrectly -- the text copies, but all of the spaces are missing. After reproducing it with a PDF, I tracked the problem down to the PDF using tabs where it probably should have used spaces. The second patch fixes HtmlFont::HtmlFilter() to convert incoming tabs to spaces, instead of removing the whitespace completely. The third patch merely emits more information in the <fontspec> elements when pdftohtml is run with -xml. The PDFs I'm trying to analyze appear to be pretty consistent with their font usage, to the point where I can use them to infer the text's meaning. But I needed more information in the <fontspec> to do that, and this patch does that for me. Please consider these for inclusion into the project.
Created attachment 140642 [details] [review] Fix HtmlFont::HtmlFilter to not lose tabs
Created attachment 140643 [details] [review] Emit more font information when pdftohtml is run with -xml
Can you please post the three patches separately? This way we can close one bug when the patch for it lands and maybe the other patches are not yet landed because there's some question about them. Makes tracking of what is left to do and what is not much easier.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.