pdftohtml doesn't extract the footer in this PDF http://docdro.id/ms8RyMC pdftohtml -s -i input.pdf /output All the text in the bottom with small font size under the thick black horizontal line is not extracted The lowest part extracted is: Forfaldsdato . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 10/08-2016
It's working fine for me. pdftohtml extracts all of the text on the page.
could you please copy/paste the output here? :)
Created attachment 125669 [details] 027-html.html from pdftohtml version 0.44.0 I ran the command: pdftohtml -s -i /media/sf_jason/Desktop/027.pdf 027 Attached is the 027-html.html file it created. Worked with both version 0.44.0 and 0.26.5.
Not major by any account, also seems fixed as per Jason's output.
Is it possible to detect/check if a PDF is broken and return something unreadable like this? I just need to check PDF files an mark broken files where the extracted text is garbage +HDGDXGLR$S6 FR-HVSHU$JHQWRIW )\UUHEDNNHQ- JHUVSULV ’HQPDUN .XQGHQU. 0RPVQU. 5HNYLVLWLRQVQU. ’HUHVUHI. 2UGUHQU. :HEEHVWLOOLQJVQU.
(In reply to clark from comment #5) > Is it possible to detect/check if a PDF is broken and return something > unreadable like this? > > I just need to check PDF files an mark broken files where the extracted text > is garbage My usual way of checking if a PDF is broken is to try it in a few different viewers and manually inspect the results. I don't have an automated way of doing this and there's not anything in a PDF that will let us predict that the output is going to be garbage, at least not reliably. Maybe you could put something together using aspell and say that it's bad if more than half of the words are misspelled or malformed (not tested): bad_count=$(aspell list < file.txt | wc -l) clean_count=$(aspell clean < file.txt | wc -l) is_bad=$(expr $bad_count \> \( $clean_count / 2 \))
thanks alot :)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.