Summary: | Improper text extraction from this pdf | ||
---|---|---|---|
Product: | poppler | Reporter: | Domingo Alvarez Duarte <mingodad> |
Component: | utils | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED NOTOURBUG | QA Contact: | |
Severity: | blocker | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | A pdf with tables |
Description
Domingo Alvarez Duarte
2016-07-14 15:02:20 UTC
I'd vote for the document being broken, have you found any viewer that can extract the text correctly? The thing is that when viewed through evince or other pdf viewer we can see the information but when try to select/copy/paste we get garbage. It seems that some people uses this to hide info from scrappers. Would be nice to have a way to extract the text it properly, that's why I posted this bug (?) hopping someone with more knowledge could find the way to do it. Cheers ! I doubt that anyone is intentionally trying to hide information. It's just that PDF is primarily a display format and unless the PDF creator does the extra work to include some encoding tables and dictionaries, it's easy to create a PDF that displays the correct glyphs, but can't be converted to text. I haven't taken a close look at this PDF, but if other viewers are also not able to extract the text, it's a good sign that the PDF was made without support for text extraction. There are heuristics in poppler that try to deal with that situation by guessing what the characters should be, but it's never going to be completely accurate. I've taken a closer look at this document and I don't see a way to fix the text extraction. The document is using embedded TrueType fonts with identity mapping / UTF-16 encoding, but the encoding is nonsensical and there's no ToUnicode map. It's up to the PDF creator to fix their broken file generation. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.