Bug 97276 - Can't extract text/html from PDF
Summary: Can't extract text/html from PDF
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-10 09:52 UTC by clark
Modified: 2016-08-19 00:43 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
027-html.html from pdftohtml version 0.44.0 (25.71 KB, text/html)
2016-08-10 15:17 UTC, Jason Crain
Details

Description clark 2016-08-10 09:52:33 UTC
pdftohtml doesn't extract the footer in this PDF

http://docdro.id/ms8RyMC

pdftohtml -s -i input.pdf /output

All the text in the bottom with small font size under the thick black horizontal line is not extracted

The lowest part extracted is:

Forfaldsdato . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 10/08-2016
Comment 1 Jason Crain 2016-08-10 15:02:50 UTC
It's working fine for me.  pdftohtml extracts all of the text on the page.
Comment 2 clark 2016-08-10 15:06:43 UTC
could you please copy/paste the output here? :)
Comment 3 Jason Crain 2016-08-10 15:17:01 UTC
Created attachment 125669 [details]
027-html.html from pdftohtml version 0.44.0

I ran the command:

    pdftohtml -s -i /media/sf_jason/Desktop/027.pdf 027

Attached is the 027-html.html file it created.  Worked with both version 0.44.0 and 0.26.5.
Comment 4 Albert Astals Cid 2016-08-15 20:03:17 UTC
Not major by any account, also seems fixed as per Jason's output.
Comment 5 clark 2016-08-16 21:43:54 UTC
Is it possible to detect/check if a PDF is broken and return something unreadable like this?

I just need to check PDF files an mark broken files where the extracted text is garbage

+HDGDXGLR$S6
FR-HVSHU$JHQWRIW
)\UUHEDNNHQ- JHUVSULV
’HQPDUN

.XQGHQU. 
0RPVQU. 
5HNYLVLWLRQVQU. 
’HUHVUHI. 
2UGUHQU. 
:HEEHVWLOOLQJVQU.
Comment 6 Jason Crain 2016-08-18 08:59:20 UTC
(In reply to clark from comment #5)
> Is it possible to detect/check if a PDF is broken and return something
> unreadable like this?
> 
> I just need to check PDF files an mark broken files where the extracted text
> is garbage

My usual way of checking if a PDF is broken is to try it in a few different viewers and manually inspect the results.  I don't have an automated way of doing this and there's not anything in a PDF that will let us predict that the output is going to be garbage, at least not reliably.

Maybe you could put something together using aspell and say that it's bad if more than half of the words are misspelled or malformed (not tested):

bad_count=$(aspell list < file.txt | wc -l)
clean_count=$(aspell clean < file.txt | wc -l)
is_bad=$(expr $bad_count \> \( $clean_count / 2 \))
Comment 7 clark 2016-08-19 00:43:21 UTC
thanks alot :)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.