Bug 22909 - Poppler fails to extract Turkish characters correctly
Summary: Poppler fails to extract Turkish characters correctly
Status: RESOLVED INVALID
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-23 06:22 UTC by İsmail Dönmez
Modified: 2009-07-24 00:43 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Sample pdf file extracted from a longer file (127.21 KB, application/pdf)
2009-07-23 06:22 UTC, İsmail Dönmez
Details

Description İsmail Dönmez 2009-07-23 06:22:24 UTC
Using pdftotext on the attached file results in Turkish characters (ı,ş,ğ and such) becoming garbled. Using splash API is resulting in the same problem so I guess its an internal Poppler issue.
Comment 1 İsmail Dönmez 2009-07-23 06:22:57 UTC
Created attachment 27949 [details]
Sample pdf file extracted from a longer file
Comment 2 Albert Astals Cid 2009-07-23 06:56:40 UTC
Adobe can't extract the text correctly either so i'm leaning to the file being faulty
Comment 3 İsmail Dönmez 2009-07-23 06:58:00 UTC
How do you extract with Adobe btw? The file for sure might be faulty, is there any way to debug what might be wrong with the file?

Thanks!
Comment 4 Albert Astals Cid 2009-07-23 14:01:09 UTC
File -> Save as Text ;-) That was easy

Probably the font mapping/encoding is not correctly set
Comment 5 İsmail Dönmez 2009-07-24 00:43:56 UTC
Yeah looks like they didn't use CP1254 but some other latin variant. Interesting bug (on the pdf creator side) :-)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.