(As discussed in http://lists.freedesktop.org/archives/poppler/2010-May/005791.html) It seems poppler is being unable to extract text in some PDF files (I'm not attaching the file to this bug report due to its lenght): http://iteisa.com/tmp/poppler-sample.pdf (11 Mb) pdftohtml from poppler 0.12.4 and 0.12.2 is not able to extract the text, and evince shows the document correctly but it's unable to select it's text. However acroread shows and selects the text correctly (so it's normal, editable text and not an image). Everything seems ok with the file: $ pdfinfo poppler-sample.pdf > Title: untitled > Creator: Adobe InDesign CS4 (6.0.4) > Producer: Acrobat Distiller 9.0.0 (Windows) > CreationDate: Wed May 5 09:35:12 2010 > ModDate: Wed May 5 09:35:12 2010 > Tagged: no > Pages: 208 > Encrypted: no > Page size: 595.276 x 841.89 pts (A4) > File size: 10536602 bytes > Optimized: no > PDF version: 1.4
I reproduced this with pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64. pdftohtml extracts the text overlaying the header image on every page: GOBIERNO de CANTABRIA B O L E T Í N O F I C I A L D E C A N TA B R I A but the rest of the page text (e.g. "sumario 1. DISPOCIONES GENERALES") is missing. And in Okular you also can't copy it as text. (BTW turning on all Okular flags in kdebugsettings doesn't seem to output any relevant warnings.) There are a few exceptions, like the table starting on page 43, where the column heading text (EXPEDIENTE SANCIONADO/A ...) appears OK but the column entries are garbled text like 6$081$6+9,/, =85$% 6$081$6+9,/, =85$% ;< ;< %(1,'250 %(1,'250 Then page 115 most of the text, starting at "ANEXO" appears. etc. Also I noticed how the image on page 147 turns into 1,347 1-pixel high pngs, but pdftohtml doesn't force them to stack e.g. using <div style="line-height: 1px; font-size: 1px;" so there's whitespace between each scanline. There's crazy stuff in them pdfs ;-)
The linked file is 404. I'm closing this bug report, as the markup makes Bugzilla choke and generate invalid XML when trying to export this bug. Given that the initial PDF is gone, there's probably no point in migrating this bug to GitLab.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.