Bug 22909

Summary: Poppler fails to extract Turkish characters correctly
Product: poppler Reporter: İsmail Dönmez <ismail>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED INVALID QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: x86 (IA32)   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Sample pdf file extracted from a longer file

Description İsmail Dönmez 2009-07-23 06:22:24 UTC
Using pdftotext on the attached file results in Turkish characters (ı,ş,ğ and such) becoming garbled. Using splash API is resulting in the same problem so I guess its an internal Poppler issue.
Comment 1 İsmail Dönmez 2009-07-23 06:22:57 UTC
Created attachment 27949 [details]
Sample pdf file extracted from a longer file
Comment 2 Albert Astals Cid 2009-07-23 06:56:40 UTC
Adobe can't extract the text correctly either so i'm leaning to the file being faulty
Comment 3 İsmail Dönmez 2009-07-23 06:58:00 UTC
How do you extract with Adobe btw? The file for sure might be faulty, is there any way to debug what might be wrong with the file?

Thanks!
Comment 4 Albert Astals Cid 2009-07-23 14:01:09 UTC
File -> Save as Text ;-) That was easy

Probably the font mapping/encoding is not correctly set
Comment 5 İsmail Dönmez 2009-07-24 00:43:56 UTC
Yeah looks like they didn't use CP1254 but some other latin variant. Interesting bug (on the pdf creator side) :-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.