Bug 28282 - pdftohtml is unable to extract the text in some PDF files
Summary: pdftohtml is unable to extract the text in some PDF files
Status: RESOLVED INVALID
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-27 07:49 UTC by Jaime
Modified: 2018-08-20 21:51 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Jaime 2010-05-27 07:49:03 UTC
(As discussed in http://lists.freedesktop.org/archives/poppler/2010-May/005791.html)

It seems poppler is being unable to extract text in some PDF files (I'm not attaching the file to this bug report due to its lenght):

http://iteisa.com/tmp/poppler-sample.pdf (11 Mb)

pdftohtml from poppler 0.12.4 and 0.12.2 is not able to extract the
text, and evince shows the document correctly but it's unable to select
it's text. However acroread shows and selects the text correctly (so
it's normal, editable text and not an image).

Everything seems ok with the file:

$ pdfinfo poppler-sample.pdf
> Title:          untitled
> Creator:        Adobe InDesign CS4 (6.0.4)
> Producer:       Acrobat Distiller 9.0.0 (Windows)
> CreationDate:   Wed May  5 09:35:12 2010
> ModDate:        Wed May  5 09:35:12 2010
> Tagged:         no
> Pages:          208
> Encrypted:      no
> Page size:      595.276 x 841.89 pts (A4)
> File size:      10536602 bytes
> Optimized:      no
> PDF version:    1.4
Comment 1 skierpage 2012-04-03 15:25:38 UTC
I reproduced this with pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64.

pdftohtml extracts the text overlaying the header image on every page:
  GOBIERNO
  de
  CANTABRIA
  B O L E T Í N O F I C I A L D E C A N TA B R I A
but the rest of the page text (e.g. "sumario 1. DISPOCIONES GENERALES") is missing. And in Okular you also can't copy it as text. (BTW turning on all Okular flags in kdebugsettings doesn't seem to output any relevant warnings.)

There are a few exceptions, like the table starting on page 43, where the column heading text (EXPEDIENTE SANCIONADO/A ...) appears OK but the column entries are garbled text like
6$081$6+9,/,  =85$%
6$081$6+9,/,  =85$%
;<
;<
%(1,'250
%(1,'250

Then page 115 most of the text, starting at "ANEXO" appears.
etc.

Also I noticed how the image on page 147 turns into 1,347 1-pixel high pngs, but pdftohtml doesn't force them to stack e.g. using <div style="line-height: 1px; font-size: 1px;" so there's whitespace between each scanline.  There's crazy stuff in them pdfs ;-)
Comment 2 Daniel Stone 2018-08-20 21:51:14 UTC
The linked file is 404. I'm closing this bug report, as the markup makes Bugzilla choke and generate invalid XML when trying to export this bug. Given that the initial PDF is gone, there's probably no point in migrating this bug to GitLab.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.