32522 – Some letters are in wrong order in the output of pdftotext

Bug 32522 - Some letters are in wrong order in the output of pdftotext

Summary: Some letters are in wrong order in the output of pdftotext

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-12-20 04:50 UTC by Bassem JARKAS
Modified:	2018-08-21 10:41 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
simple test case using one word (8.15 KB, application/pdf) 2010-12-20 06:15 UTC, Bassem JARKAS	Details
don't reverse chars that map to a single glyph (11.52 KB, patch) 2010-12-23 00:25 UTC, Adrian Johnson	Details \| Splinter Review
pdftotext crashes with the patch and this pdf (352.54 KB, application/pdf) 2010-12-26 11:50 UTC, Albert Astals Cid	Details
updated patch to fix crash (11.53 KB, patch) 2010-12-26 16:04 UTC, Adrian Johnson	Details \| Splinter Review
Show Obsolete (1) View All

Description Bassem JARKAS 2010-12-20 04:50:21 UTC

I have a pdf file created in Adobe InDesign CS3 (5.0.4) with an embedded Arabic font called AXtManal, this font was created to work around the limitation of publishing softwares of creating Arabic documents.

pdftotext v0.15.3 (and older versions) renders some letters in wrong order, for eample: the word "abcd" appears "acbd", and this error repeated with many groups of letters, like "l" and "a", "m" and "j", "r" and "y" ..etc

Evince displayed the file correctly with the correct order and the correct layout. the problem is only in the extracting.

Any idea how to fix that?


you can find the pdf sample here: https://sites.google.com/site/jarkas/Home/049.pdf?attredirects=0&d=1
and the text output: https://sites.google.com/site/jarkas/Home/049_0.15.3.txt?attredirects=0&d=1

Best Regards

Comment 1 Bassem JARKAS 2010-12-20 06:15:47 UTC

Created attachment 41293 [details]
simple test case using one word

simple test case using one word, the error is reversed characters, 2nd with 3rd, and 7th with 8th.

Comment 2 Adrian Johnson 2010-12-20 14:00:32 UTC

Loooking at the PDF, the string is printed with this operation:

  <00CD 00A3 0095 0070 00B4 002A >Tj

I added the spaces for readability.

the toUnicode map is:

  6 beginbfchar
  <002A> <0627>
  <0070> <062D>
  <0095> <0644>
  <00A3> <064A>
  <00B4> <06440645>
  <00CD> <064A0646>
  endbfchar

so when the text is extracted the sequence of unicode is:

  064A0646 064A 0644 062D 06440645 0627

the output from 

  pdftotext -enc UCS-2 049.pdf - | hexdump -C

is

  00000000  20 2b 06 27 06 45 06 44  06 2d 06 44 06 4a 06 46  | +.'.E.D.-.D.J.F|
  00000010  06 4a 20 2c 00 0a 00 0a  00 0c                    |.J ,......|

pdftotext has output the unicode characters in reverse order as you
would expect for a RTL script. It looks like the glyphs that mapped to
two unicode characters have their characters reversed.

Comment 3 Adrian Johnson 2010-12-20 14:03:31 UTC

Behdad, you may be interested in this bug. I have no idea how RTL text extraction is supposed to work.

Comment 4 Behdad Esfahbod 2010-12-20 16:50:34 UTC

(In reply to comment #3)
> Behdad, you may be interested in this bug. I have no idea how RTL text
> extraction is supposed to work.

This is a poppler bug.  I mentioned this in my design doc back in 2007, but never followed up.  Adrian, maybe you can look into those?

http://lists.freedesktop.org/archives/poppler/2007-September/002897.html

Specifically:

"""
    o Instead of reversing the glyphs and then extracting text
      from them and append, it extracts text first and then
      reverse.  So if a glyph maps to two or more characters,
      those come out backward in the extracted text, which is
      wrong.
"""

Comment 5 Bassem JARKAS 2010-12-21 01:23:36 UTC

Exactly, that's the problem! if a glyph maps to two or more characters it will appears in reverse order.


So the solution might be as Behdad mentioned before, or by applying a new toUnicode map. for example:
instead of  <00B4> <06440645>
we can write <00B4> <06450644>
Is that possible with poppler?!!

I've tried that with the old pdftotext of xpdf-3.02 which accept the option -cfg, but it seems that it didn't read it.
Is there any way to do it with poppler?!!

Comment 6 Adrian Johnson 2010-12-21 01:31:16 UTC

I'll have a look at the code next week and see if I can come up with a patch to fix it.

Does Adobe Reader extract the text correctly?

Comment 7 Bassem JARKAS 2010-12-21 09:47:58 UTC

(In reply to comment #6)

> Does Adobe Reader extract the text correctly?

With copy and paste it's ok. and with text extraction it gives but extra U+FFFD characters with some words.

Comment 8 Bassem JARKAS 2010-12-21 09:51:08 UTC

Hi Adrian,
Please can you tell me from where you got this:

> Loooking at the PDF, the string is printed with this operation:
> 
>   <00CD 00A3 0095 0070 00B4 002A >Tj
> 

And this too:

> the toUnicode map is:
> 
>   6 beginbfchar
>   <002A> <0627>
>   <0070> <062D>
>   <0095> <0644>
>   <00A3> <064A>
>   <00B4> <06440645>
>   <00CD> <064A0646>
>   endbfchar

Thank you

Comment 9 Bassem JARKAS 2010-12-21 09:52:59 UTC

(In reply to comment #7)
> (In reply to comment #6)
> 
> > Does Adobe Reader extract the text correctly?
> 
> With copy and paste it's ok. and with text extraction it gives but extra U+FFFD
> characters with some words.

sorry for the typo:
With copy and paste it's ok. and with text extraction it gives an extra U+FFFD characters with some words.

Comment 10 Adrian Johnson 2010-12-21 13:33:44 UTC

(In reply to comment #8)
> Hi Adrian,
> Please can you tell me from where you got this:
> 
> > Loooking at the PDF, the string is printed with this operation:
> > 
> >   <00CD 00A3 0095 0070 00B4 002A >Tj
> > 

Uncompress the PDF:

  pdftk one_word_arabic.pdf output one_word_arabic-u.pdf uncompress

then open the output with a text editor.

Comment 11 Adrian Johnson 2010-12-23 00:25:27 UTC

Created attachment 41395 [details] [review]
don't reverse chars that map to a single glyph

The attached patch avoids reversing unicode chars that are mapped to a single glyph. Testing with the arabic_one_word.pdf test case works for me.

Comment 12 Adrian Johnson 2010-12-23 00:32:22 UTC

(In reply to comment #9)
> sorry for the typo:
> With copy and paste it's ok. and with text extraction it gives an extra U+FFFD
> characters with some words.

The sample PDF in comment 0 has ActualText around some of the glyphs that maps the glyphs to U+FFFD. ie

  /Span<</ActualText<FEFFFFFD>>> BDC 
  0 0.02 TD
  <0067>Tj
  EMC 

If the U+FFFD character should not appeared in extracted text, it is a problem with the application that created the PDF.

Comment 13 Albert Astals Cid 2010-12-25 04:25:30 UTC

Hi Adrian, to regtest your patch i understand that I can use pdftotext and the new output should be the same than the old except in files using RTL characters?

Comment 14 Adrian Johnson 2010-12-25 16:21:52 UTC

(In reply to comment #13)
> Hi Adrian, to regtest your patch i understand that I can use pdftotext and the
> new output should be the same than the old except in files using RTL
> characters?

That's correct. There should be no change in the pdftotext output from PDFs using LTR text.

Comment 15 Bassem JARKAS 2010-12-26 06:10:23 UTC

(In reply to comment #11)
> Created an attachment (id=41395) [details]
> don't reverse chars that map to a single glyph
> 
> The attached patch avoids reversing unicode chars that are mapped to a single
> glyph. Testing with the arabic_one_word.pdf test case works for me.

Hello Adrian,
I've tried the patch with poppler-0.15.3.tar.gz downloaded from the website.
The patch corrected some words, and didn't correct many, and corrupted many words.

Comment 16 Bassem JARKAS 2010-12-26 06:28:09 UTC

During the last few days I've wrote a simple shell script as work around to solve my problem for the time being.
The script inverts the glyphs which map to two or three characters.
The output is very good, but still have the annoying char U+FFFD, which I removed it by another script.

Comment 17 Albert Astals Cid 2010-12-26 11:50:16 UTC

Created attachment 41459 [details]
pdftotext crashes with the patch and this pdf

Can you have a look at the crash produced by your patch in this pdf?

Comment 18 Adrian Johnson 2010-12-26 16:04:15 UTC

Created attachment 41466 [details] [review]
updated patch to fix crash

Updated patch to fix crash with pdf in comment 17.

Comment 19 Adrian Johnson 2010-12-26 16:09:00 UTC

(In reply to comment #15)
> Hello Adrian,
> I've tried the patch with poppler-0.15.3.tar.gz downloaded from the website.
> The patch corrected some words, and didn't correct many, and corrupted many
> words.

Please test with my updated patch. If you are still seeing the same problem, please provide the simplest possible test case and explain exactly which characters are wrong. I can't read arabic so I need a simple test case where I can check the hex values of the text output.

Comment 20 GitLab Migration User 2018-08-21 10:41:04 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/322.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.