I have a pdf file created in Adobe InDesign CS3 (5.0.4) with an embedded Arabic font called AXtManal, this font was created to work around the limitation of publishing softwares of creating Arabic documents. pdftotext v0.15.3 (and older versions) renders some letters in wrong order, for eample: the word "abcd" appears "acbd", and this error repeated with many groups of letters, like "l" and "a", "m" and "j", "r" and "y" ..etc Evince displayed the file correctly with the correct order and the correct layout. the problem is only in the extracting. Any idea how to fix that? you can find the pdf sample here: https://sites.google.com/site/jarkas/Home/049.pdf?attredirects=0&d=1 and the text output: https://sites.google.com/site/jarkas/Home/049_0.15.3.txt?attredirects=0&d=1 Best Regards
Created attachment 41293 [details] simple test case using one word simple test case using one word, the error is reversed characters, 2nd with 3rd, and 7th with 8th.
Loooking at the PDF, the string is printed with this operation: <00CD 00A3 0095 0070 00B4 002A >Tj I added the spaces for readability. the toUnicode map is: 6 beginbfchar <002A> <0627> <0070> <062D> <0095> <0644> <00A3> <064A> <00B4> <06440645> <00CD> <064A0646> endbfchar so when the text is extracted the sequence of unicode is: 064A0646 064A 0644 062D 06440645 0627 the output from pdftotext -enc UCS-2 049.pdf - | hexdump -C is 00000000 20 2b 06 27 06 45 06 44 06 2d 06 44 06 4a 06 46 | +.'.E.D.-.D.J.F| 00000010 06 4a 20 2c 00 0a 00 0a 00 0c |.J ,......| pdftotext has output the unicode characters in reverse order as you would expect for a RTL script. It looks like the glyphs that mapped to two unicode characters have their characters reversed.
Behdad, you may be interested in this bug. I have no idea how RTL text extraction is supposed to work.
(In reply to comment #3) > Behdad, you may be interested in this bug. I have no idea how RTL text > extraction is supposed to work. This is a poppler bug. I mentioned this in my design doc back in 2007, but never followed up. Adrian, maybe you can look into those? http://lists.freedesktop.org/archives/poppler/2007-September/002897.html Specifically: """ o Instead of reversing the glyphs and then extracting text from them and append, it extracts text first and then reverse. So if a glyph maps to two or more characters, those come out backward in the extracted text, which is wrong. """
Exactly, that's the problem! if a glyph maps to two or more characters it will appears in reverse order. So the solution might be as Behdad mentioned before, or by applying a new toUnicode map. for example: instead of <00B4> <06440645> we can write <00B4> <06450644> Is that possible with poppler?!! I've tried that with the old pdftotext of xpdf-3.02 which accept the option -cfg, but it seems that it didn't read it. Is there any way to do it with poppler?!!
I'll have a look at the code next week and see if I can come up with a patch to fix it. Does Adobe Reader extract the text correctly?
(In reply to comment #6) > Does Adobe Reader extract the text correctly? With copy and paste it's ok. and with text extraction it gives but extra U+FFFD characters with some words.
Hi Adrian, Please can you tell me from where you got this: > Loooking at the PDF, the string is printed with this operation: > > <00CD 00A3 0095 0070 00B4 002A >Tj > And this too: > the toUnicode map is: > > 6 beginbfchar > <002A> <0627> > <0070> <062D> > <0095> <0644> > <00A3> <064A> > <00B4> <06440645> > <00CD> <064A0646> > endbfchar Thank you
(In reply to comment #7) > (In reply to comment #6) > > > Does Adobe Reader extract the text correctly? > > With copy and paste it's ok. and with text extraction it gives but extra U+FFFD > characters with some words. sorry for the typo: With copy and paste it's ok. and with text extraction it gives an extra U+FFFD characters with some words.
(In reply to comment #8) > Hi Adrian, > Please can you tell me from where you got this: > > > Loooking at the PDF, the string is printed with this operation: > > > > <00CD 00A3 0095 0070 00B4 002A >Tj > > Uncompress the PDF: pdftk one_word_arabic.pdf output one_word_arabic-u.pdf uncompress then open the output with a text editor.
Created attachment 41395 [details] [review] don't reverse chars that map to a single glyph The attached patch avoids reversing unicode chars that are mapped to a single glyph. Testing with the arabic_one_word.pdf test case works for me.
(In reply to comment #9) > sorry for the typo: > With copy and paste it's ok. and with text extraction it gives an extra U+FFFD > characters with some words. The sample PDF in comment 0 has ActualText around some of the glyphs that maps the glyphs to U+FFFD. ie /Span<</ActualText<FEFFFFFD>>> BDC 0 0.02 TD <0067>Tj EMC If the U+FFFD character should not appeared in extracted text, it is a problem with the application that created the PDF.
Hi Adrian, to regtest your patch i understand that I can use pdftotext and the new output should be the same than the old except in files using RTL characters?
(In reply to comment #13) > Hi Adrian, to regtest your patch i understand that I can use pdftotext and the > new output should be the same than the old except in files using RTL > characters? That's correct. There should be no change in the pdftotext output from PDFs using LTR text.
(In reply to comment #11) > Created an attachment (id=41395) [details] > don't reverse chars that map to a single glyph > > The attached patch avoids reversing unicode chars that are mapped to a single > glyph. Testing with the arabic_one_word.pdf test case works for me. Hello Adrian, I've tried the patch with poppler-0.15.3.tar.gz downloaded from the website. The patch corrected some words, and didn't correct many, and corrupted many words.
During the last few days I've wrote a simple shell script as work around to solve my problem for the time being. The script inverts the glyphs which map to two or three characters. The output is very good, but still have the annoying char U+FFFD, which I removed it by another script.
Created attachment 41459 [details] pdftotext crashes with the patch and this pdf Can you have a look at the crash produced by your patch in this pdf?
Created attachment 41466 [details] [review] updated patch to fix crash Updated patch to fix crash with pdf in comment 17.
(In reply to comment #15) > Hello Adrian, > I've tried the patch with poppler-0.15.3.tar.gz downloaded from the website. > The patch corrected some words, and didn't correct many, and corrupted many > words. Please test with my updated patch. If you are still seeing the same problem, please provide the simplest possible test case and explain exactly which characters are wrong. I can't read arabic so I need a simple test case where I can check the hex values of the text output.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/322.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.