Bug 106406 - pdftotext cannot extract text correctly from specific pdf
Summary: pdftotext cannot extract text correctly from specific pdf
Status: RESOLVED NOTABUG
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-05 07:07 UTC by Derrick
Modified: 2018-08-20 22:11 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
source pdf file (311.09 KB, application/pdf)
2018-05-05 07:07 UTC, Derrick
Details
data (126.08 KB, image/png)
2018-05-05 09:31 UTC, Derrick
Details
debug information of pdf file by poppler (18.21 KB, text/plain)
2018-05-06 13:52 UTC, Derrick
Details

Description Derrick 2018-05-05 07:07:07 UTC
Created attachment 139361 [details]
source pdf file

Hi,

  pdftotext fails to extract text correctly from specific pdf (see attachment).
exit status is 0 and no warnings or errors are reported.

  command:
  pdftotext.exe  -bbox-layout q7.pdf

  product version:
  poppler      0.62.0     x86_64
  poppler-data 0.4.8  

best regards
Derrick
Comment 1 Albert Astals Cid 2018-05-05 09:26:08 UTC
As far as i can see there's no correct text info in that pdf (at least Acrobat can't seem to extract the text either)
Comment 2 Derrick 2018-05-05 09:31:31 UTC
Created attachment 139365 [details]
data
Comment 3 Derrick 2018-05-05 09:32:49 UTC
The data is chinese, i can view the pdf data via the microsoft edge browser.
My computer system language is chinese.
Comment 4 Albert Astals Cid 2018-05-05 09:52:35 UTC
You can view the PDF too using a poppler based viewer (okular, evince, pdftoppm).

That doesn't mean you can extract the text. It's two completely different things
Comment 5 Derrick 2018-05-06 13:51:11 UTC
I download the evince, the file could be viewed by evince.
Could you tell me why it could not be extracted by pdftotext?

I know the file can't be extracted under the following scenes:
1. The word is drawed by line, not word.
2. The font type is fontTrueType or fontTrueTypeOT, the character set is same as system character set.

But in this case, the font type is "fontCIDType2 Identity-H BFAIEF+SimSun-GBK-EUC-H", I think it should by extracted correctlly.
====================================================
gs /GS1
exec op gs
  gfx state dict: << /OP false /OPM 1 /SA false /SM 0.02 /Type /ExtGState /op false >>
BT
exec op BT
Tf /TT2 1
exec op Tf
 opSetFont args[0].getName=TT2
  font: tag=TT2 name='BFAIEF+SimSun-GBK-EUC-H' 1
Tm 16.0751 0 0 16.0751 167.949 734.655
exec op Tm
cs /Cs6
exec op cs
scn 0.019608 0.003922 0
exec op scn
Tc 0
exec op Tc
Tw 0
exec op Tw
Tj (ɩ
exec op Tj
font doShowText key=TT2 name=10 Identity-H BFAIEF+SimSun-GBK-EUC-H
ET
exec op ET
q
exec op q
i 1
exec op i
===============================================================

By default, poppler think the character is UCS-4 in operation TJ, maybe in this case, the character is other, cause extracted uncorrectlly. 

The attachment is debug information of the pdf file by poppler.
Comment 6 Derrick 2018-05-06 13:52:25 UTC
Created attachment 139392 [details]
debug information of pdf file by poppler
Comment 7 Albert Astals Cid 2018-05-06 15:14:25 UTC
Sorry, I don't really have time to explain to you how pdf files work.

I've not been able to find any other PDF viewer (neither the one from the people that invented PDF itself) out there that can extract the text from that file, so i'd say it's just the file being broken.

If you can extract the text with some PDF tool or even better provide a patch i'll be happy to review it.
Comment 8 Daniel Stone 2018-08-20 22:11:26 UTC
This bug report contains markup which cannot be properly exported by Bugzilla, making it generate invalid XML. As it seems the result is that this bug will not be taken forward, I'm closing it now rather than migrating it to GitLab.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.