Bug 38739 - Polish characters incorrectly extracted
Summary: Polish characters incorrectly extracted
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-28 02:37 UTC by Urmas
Modified: 2018-08-20 22:09 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Urmas 2011-06-28 02:37:18 UTC
The poppler extracts ż as Ŝ and Ż as ś, which makes it confused with real ś, resulting in data loss.

File contains crap like this:

1648 0 obj
<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[
1/eogonek/sacute/nacute/zdot/cacute/aogonek/Zdot/zacute/Sacute]>>
endobj
Comment 1 Albert Astals Cid 2011-06-28 05:12:10 UTC
Please attach a file showing the problem.
Comment 2 Albert Astals Cid 2011-07-06 14:18:57 UTC
You marked it as critical and then do not even bother answering questions?
Comment 3 Urmas 2011-07-07 11:26:28 UTC
I cannot attach documents unscrambled, and have no idea how to scramble it.
Also the resulting file is irrevocably corrupted, it's not just a normal bug for text extraction application.
Comment 4 Albert Astals Cid 2011-07-08 12:58:00 UTC
What do you mean with "scramble"?
Comment 5 Urmas 2011-07-17 04:11:10 UTC
I. e. replace sensitive document text with junk.
Comment 6 Albert Astals Cid 2011-07-18 03:58:35 UTC
As said, without file, we won't have a look at this, we have lots of bugs we can easily reproduce, so working on one that is not is a bad idea (Note that i'm not saying that if we get the file we will work on it, it just makes its chances bigger)
Comment 7 Urmas 2015-01-31 20:02:41 UTC
The font document is in have the following /Encoding:

/BaseEncoding /WinAnsiEncoding /Differences [ 1 /eogonek /zdot /zacute /cacut
e /sacute /aogonek /Zdot /nacute /Sacute /Eogonek /Zacute ] /Type /Encoding

but it has the following /ToUnicode:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R1642 def
1 begincodespacerange
<00><ff>
endcodespacerange
11 beginbfrange
<01><01><0119>
<02><02><015c>
<03><03><017a>
<04><04><0107>
<05><05><015b>
<06><06><0105>
<07><07><015b>
<08><08><0144>
<09><09><015a>
<0a><0a><0118>
<0b><0b><0179>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

So why does your software interpret character #7 not as EXPLICITLY DEFINED \Zdot, but as 0x015B?
Comment 8 Urmas 2015-01-31 20:27:24 UTC
P.S. If you need a sample document, search the web for "waŜne". That should return plenty of public domain documents.
Comment 9 Jason Crain 2015-02-01 08:52:29 UTC
(In reply to Urmas from comment #7)
> So why does your software interpret character #7 not as EXPLICITLY DEFINED
> \Zdot, but as 0x015B?

Because it's what the specification says to do.  The ToUnicode CMap takes precedence over the Differences array.
Comment 10 GitLab Migration User 2018-08-20 22:09:13 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/220.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.