The poppler extracts ż as Ŝ and Ż as ś, which makes it confused with real ś, resulting in data loss.
File contains crap like this:
1648 0 obj
Please attach a file showing the problem.
You marked it as critical and then do not even bother answering questions?
I cannot attach documents unscrambled, and have no idea how to scramble it.
Also the resulting file is irrevocably corrupted, it's not just a normal bug for text extraction application.
What do you mean with "scramble"?
I. e. replace sensitive document text with junk.
As said, without file, we won't have a look at this, we have lots of bugs we can easily reproduce, so working on one that is not is a bad idea (Note that i'm not saying that if we get the file we will work on it, it just makes its chances bigger)
The font document is in have the following /Encoding:
/BaseEncoding /WinAnsiEncoding /Differences [ 1 /eogonek /zdot /zacute /cacut
e /sacute /aogonek /Zdot /nacute /Sacute /Eogonek /Zacute ] /Type /Encoding
but it has the following /ToUnicode:
/CIDInit /ProcSet findresource begin
12 dict begin
/CMapType 2 def
CMapName currentdict /CMap defineresource pop
So why does your software interpret character #7 not as EXPLICITLY DEFINED \Zdot, but as 0x015B?
P.S. If you need a sample document, search the web for "waŜne". That should return plenty of public domain documents.
(In reply to Urmas from comment #7)
> So why does your software interpret character #7 not as EXPLICITLY DEFINED
> \Zdot, but as 0x015B?
Because it's what the specification says to do. The ToUnicode CMap takes precedence over the Differences array.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/220.