96932 – Improper text extraction from this pdf

Bug 96932 - Improper text extraction from this pdf

Summary: Improper text extraction from this pdf

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	medium blocker
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-14 15:02 UTC by Domingo Alvarez Duarte
Modified:	2016-08-05 15:19 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
A pdf with tables (68.22 KB, application/pdf) 2016-07-14 15:02 UTC, Domingo Alvarez Duarte	Details
View All

Description Domingo Alvarez Duarte 2016-07-14 15:02:20 UTC

Created attachment 125069 [details]
A pdf with tables

Hello !
I'm testing pdftotxt with pdfs from http://www.docidadesp.imprensaoficial.com.br and there is several of then that seems to have mixed encodings (I gues) and outputs garbage for some of it's content (PDFxStream do the same).
See the attached pdf for test.

I hope the attached example can help improve poppler.

Cheers !

Comment 1 Albert Astals Cid 2016-07-14 21:17:53 UTC

I'd vote for the document being broken, have you found any viewer that can extract the text correctly?

Comment 2 Domingo Alvarez Duarte 2016-07-15 13:29:50 UTC

The thing is that when viewed through evince or other pdf viewer we can see the information but when try to select/copy/paste we get garbage.
It seems that some people uses this to hide info from scrappers.
Would be nice to have a way to extract the text it properly, that's why I posted this bug (?) hopping someone with more knowledge could find the way to do it.

Cheers !

Comment 3 Jason Crain 2016-07-15 15:55:29 UTC

I doubt that anyone is intentionally trying to hide information.  It's just that PDF is primarily a display format and unless the PDF creator does the extra work to include some encoding tables and dictionaries, it's easy to create a PDF that displays the correct glyphs, but can't be converted to text.

I haven't taken a close look at this PDF, but if other viewers are also not able to extract the text, it's a good sign that the PDF was made without support for text extraction.  There are heuristics in poppler that try to deal with that situation by guessing what the characters should be, but it's never going to be completely accurate.

Comment 4 Jason Crain 2016-08-05 15:19:18 UTC

I've taken a closer look at this document and I don't see a way to fix the text extraction.  The document is using embedded TrueType fonts with identity mapping / UTF-16 encoding, but the encoding is nonsensical and there's no ToUnicode map.  It's up to the PDF creator to fix their broken file generation.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.