Bug 43488 - pdf file with Arabic text comtent does not transformed well!
Summary: pdf file with Arabic text comtent does not transformed well!
Status: RESOLVED NOTABUG
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-03 07:00 UTC by Said Bakr
Modified: 2016-03-20 16:02 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
The pdf file with Arabic content text. (69.42 KB, application/pdf)
2011-12-03 07:00 UTC, Said Bakr
Details
Screenshot for Acrobat reader plugin render of the file (131.14 KB, image/png)
2011-12-07 14:03 UTC, Said Bakr
Details

Description Said Bakr 2011-12-03 07:00:54 UTC
Created attachment 54089 [details]
The pdf file with Arabic content text.

The attached file is an example of a pdf file with Arabic text content that pdftotext and pdftohtml does not able to transform them into text at all.

It is only parenthesis and some integers.
The following is a partial copy of produced text file:

("

"

)

:
/

:
:
(

/
)

( )
1:

("

"

)

:
/

:
:
(

/
)

( )
1:

("

"
Comment 1 Albert Astals Cid 2011-12-03 07:16:35 UTC
Not critical
Comment 2 Albert Astals Cid 2011-12-03 07:16:51 UTC
Not a pdftohtml only bug.
Comment 3 Albert Astals Cid 2011-12-03 07:17:55 UTC
Are you sure the file is not sumply broken? Does this file open correctly in any pdf viewer? Adobe Reader 9.4.6 in Linux is not able to render it correctly either.
Comment 4 Said Bakr 2011-12-03 09:08:57 UTC
I'm sure that both Adobe Reader and document viewer of Ubuntu 11.10 are able to open and read this file correctly, If you able to download the attached file you will notice this.
Comment 5 Albert Astals Cid 2011-12-03 14:24:19 UTC
I am using evince (with i guess is what you mean with "Document Viewer") in Ubuntu 11.10 and it does not work.
Comment 6 Said Bakr 2011-12-03 14:41:32 UTC
(In reply to comment #5)
> I am using evince (with i guess is what you mean with "Document Viewer") in
> Ubuntu 11.10 and it does not work.
I installed some MS Fonts, including, since some time ago. I think it was ttf-mscorefonts-installer. So evince could able to open it. By the way, my system able to write Arabic. i.e. I have Arabic keyboard layout.
Comment 7 Said Bakr 2011-12-07 14:03:32 UTC
Created attachment 54200 [details]
Screenshot for Acrobat reader plugin render of the file

This png file is a screenshot for the render of Acrobat Reader plugin in Google Chrome browser that open the attached pdf file.
Comment 8 Jason Crain 2016-03-20 16:02:47 UTC
I would not consider this a bug in poppler, pdftohtml, or pdftotext.  The document uses glyph IDs instead of a real character encoding and does not embed fonts.  Since glyph IDs are only meaningful for one particular font, this means that this document can only be viewed correctly if you have the correct font installed (Microsoft's Arial font, in this case).  And since it doesn't use a real character encoding, poppler can't get the text out of the document and pdftohtml and pdftotext will not work.  Note that Adobe Reader and other PDF viewers can't get the text either.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.