Bug 103309 - pdftotext: UTF-16 text without BOM not properly extracted
Summary: pdftotext: UTF-16 text without BOM not properly extracted
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-17 10:22 UTC by ralf.stubner
Modified: 2018-08-21 10:41 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Sample file (166.82 KB, application/pdf)
2017-10-17 10:22 UTC, ralf.stubner
Details

Description ralf.stubner 2017-10-17 10:22:13 UTC
Created attachment 134881 [details]
Sample file

When I use pdftotext with the attached sample file I get no usable text. When looking at the file with a hex editor, I can see that the text is available as UTF-16BE *without* BOM. The display with xpdf is fine.

Tested with version 0.48.0 (Debian Stable) and 0.57.0 (Debian Testing).
Comment 1 ralf.stubner 2017-10-17 10:38:01 UTC
Additional note:

$ java -jar pdfbox-app-2.0.7.jar ExtractText 2004.pdf

Extracts the text but issues some warnings:

Okt 17, 2017 12:34:44 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNUNG: Invalid ToUnicode CMap in font JRLFSC+Segoe UI,Bold-Identity-H
Okt 17, 2017 12:34:44 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNUNG: Invalid ToUnicode CMap in font EUPBOV+Arial Unicode MS-Identity-H
Okt 17, 2017 12:34:44 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNUNG: Invalid ToUnicode CMap in font VRSAOT+Arial Unicode MS,Bold-Identity-H
Okt 17, 2017 12:34:44 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNUNG: Invalid ToUnicode CMap in font FAMOVB+Segoe UI-Identity-H
Comment 2 GitLab Migration User 2018-08-21 10:41:59 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/332.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.