Summary: | pdftohtml is unable to extract the text in some PDF files | ||
---|---|---|---|
Product: | poppler | Reporter: | Jaime <jaime> |
Component: | general | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED INVALID | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | info |
Version: | unspecified | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Description
Jaime
2010-05-27 07:49:03 UTC
I reproduced this with pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64. pdftohtml extracts the text overlaying the header image on every page: GOBIERNO de CANTABRIA B O L E T Í N O F I C I A L D E C A N TA B R I A but the rest of the page text (e.g. "sumario 1. DISPOCIONES GENERALES") is missing. And in Okular you also can't copy it as text. (BTW turning on all Okular flags in kdebugsettings doesn't seem to output any relevant warnings.) There are a few exceptions, like the table starting on page 43, where the column heading text (EXPEDIENTE SANCIONADO/A ...) appears OK but the column entries are garbled text like 6$081$6+9,/, =85$% 6$081$6+9,/, =85$% ;< ;< %(1,'250 %(1,'250 Then page 115 most of the text, starting at "ANEXO" appears. etc. Also I noticed how the image on page 147 turns into 1,347 1-pixel high pngs, but pdftohtml doesn't force them to stack e.g. using <div style="line-height: 1px; font-size: 1px;" so there's whitespace between each scanline. There's crazy stuff in them pdfs ;-) The linked file is 404. I'm closing this bug report, as the markup makes Bugzilla choke and generate invalid XML when trying to export this bug. Given that the initial PDF is gone, there's probably no point in migrating this bug to GitLab. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.