Bug 20013

Summary:	pdftotext doesn't support /Alt nor /ActualText with octal content
Product:	poppler	Reporter:	Ross Moore <ross>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	All
OS:	All
URL:	http://www.maths.mq.edu.au/~ross/poppler/Big5/
Whiteboard:
i915 platform:		i915 features:
Attachments:	This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc .

Description Ross Moore 2009-02-08 21:55:05 UTC

Created attachment 22702 [details] [review]
This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base;
e.g. in  poppler/CMap.cc  and poppler/Outline.cc .

Trying to extract text from PDFs constructed using pdfTeX and containing /ActualText or /Alt tags does not give the desired results.

 1.  /Alt tagging is not supported at all.
 2.  /ActualText tagging is recognised but no content is extracted.

Here are some examples from 
     http://www.maths.mq.edu.au/~ross/poppler/Big5/

Big5-actual.pdf  (170kb) --- has /ActualText tagging
Big5-actual.txt   (97 bytes)
Big5-alt.pdf        (169kb) --- has /Alt tagging
Big5-alt.txt         (434 bytes)
Big5-notags.pdf  (157kb) --- no special tagging
Big5-notags.txt   (432 bytes)

The corresponding .txt files were obtained using pdftotext -raw
with  pdftotext/Poppler  version as follows:

[GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
pdftotext version 0.10.3

Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC


It is clear just from the file-size of  Big5-actual.txt  that Poppler isn't extracting the /ActualText in this case.
Also, if you look at the contents of  Big5-notags.txt  you'll see the same kind of "multiple-striking" to get the bold effect.

With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
is meant to be mapped to a single Unicode character.
But Poppler has no support for /Alt tagging, which is why
Big5-alt.txt is practically the same size as Big5-notags.txt .


With these three PDFs, Adobe Reader cannot extract the chinese characters from  Big5-notags.pdf
whereas it can do so from  Big5-actual.pdf  and  Big5-alt.pdf  due to the extra tagging.

Apple's Preview and Poppler, on the other hand, can identify the characters (presumably from information in the fonts or their encoding arrays --- a CMap is not applicable). But both extract three copies when the multiple striking occurs, so are not dealing with the /Alt or /ActualText tags.
Furthermore, Poppler gives nothing for the ideographs marked with /ActualText tagging.


Speculation:  poppler may not be extracting the information in the tagging strings since they contain octal character codes?  For example, the tagging looks like this:

        /Span<</ActualText(\376\377\307\164)>> BDC
            ... Chinese/Korean ideograph ...
       EMC

whereas the coding in  TextOutputDev.cc  that handles this is:

         actualText = obj.getString();

and

      if (!actualText->hasUnicodeMarker()) {
        if (actualText->getLength() > 0) {
          //non-unicode string -- assume pdfDocEncoding and
          //try to convert to UTF16BE
          uniString = pdfDocEncodingToUTF16(actualText, &length);
        } else {
          length = 0;
        }
      } else {
        uniString = actualText->getCString();
        length = actualText->getLength();
      }

Shouldn't there be some use of  GooString  within this coding block, to properly handle those octal character codes?

There are some more similar examples, involving Korean fonts, at:
    http://www.maths.mq.edu.au/~ross/poppler/KS/

Comment 1 Ross Moore 2009-02-17 10:59:13 UTC

Comment on attachment 22702 [details] [review]
This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base;
e.g. in  poppler/CMap.cc  and poppler/Outline.cc .

This bug is due to improper extraction of the text in the /ActualText entry.
Here is a better description of the effects observed.


I'm now creating PDFs with /ActualText strings for CJK ideographs.
These strings are given in big-endian UTF-16 format.
Using  pdftotext  to extract the text, what I find is that:

 a)  some, but not all, UTF-16 byte-pairs produce an extractable
     character.

 b)  Whenever the *first* byte of the pair is in the upper range
      128--255 then the whole character is omitted.

    For example, with the PDF string:  (˛ˇt»»tt»)
    the text extracted using Adobe Reader is  瓈존瓈
    but Poppler produces  珈珈 , which exhibits two errors.

  Firstly, ...

    the portion '»t' has been extracted as '', the empty string,
    between the chinese ideographs.

    In alternative representations, this is:
     (<FE><FF>t<C8><C8>tt<C8>)  producing  <E7><8F><88><E7><8F><88> ,
    where  t<C8> representing  't»'  extracts to
      <E7><8F><88>  which is  珈 .

  Secondly, ...

 c) There is an error in the translation of UTF-16 characters
    into UTF-8. For example,  the above  t<C8>  should actually
    convert in UTF-8 to   <E7><93><88>   which is  瓈 ,
    as done by Adobe and other software.

    The <E7><8F><88> is what correctly comes from  s<C8> ;
    the top-order byte is being mistranslated by -1.


Further comment.

  d)  octal codes can be used, contrary to a question that I raised
     in bug report 20013 .
     There my testing was with codes which produced 1st bytes
     within the upper range, so the difficulties were the same
     as in b) above.

 e)  the example PDF  http://www.unicode.org/udhr/d/udhr_san.pdf
      used to test the /ActualText support involved only characters
      in the range  Ux0A..  so that the problem (b) with higher range
      characters did not occur; and nor does (c) for this range.


Here's a patch that fixes the problem.
The new line of coding is based upon similar methods used in 
     poppler/Outline.cc .


*** TextOutputDev-prev.cc       Wed Feb 18 04:59:28 2009
--- TextOutputDev.cc    Wed Feb 18 05:42:22 2009
*************** void TextOutputDev::endMarkedContent(Gfx
*** 4657,4663 ****
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1];
  
        text->addChar(state,
                    actualText_x, actualText_y,
--- 4657,4663 ----
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff);
  
        text->addChar(state,
                    actualText_x, actualText_y,

Comment 2 Ross Moore 2009-02-25 14:13:26 UTC

Comment on attachment 22702 [details] [review]
This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base;
e.g. in  poppler/CMap.cc  and poppler/Outline.cc .

*** TextOutputDev-prev.cc       Wed Feb 18 04:59:28 2009
--- TextOutputDev.cc    Wed Feb 18 05:42:22 2009
*************** void TextOutputDev::endMarkedContent(Gfx
*** 4657,4663 ****
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1];
  
        text->addChar(state,
                    actualText_x, actualText_y,
--- 4657,4663 ----
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff);
  
        text->addChar(state,
                    actualText_x, actualText_y,

Comment 3 Ross Moore 2009-02-25 15:30:40 UTC

Comment on attachment 22702 [details] [review]
This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base;
e.g. in  poppler/CMap.cc  and poppler/Outline.cc .

Having trouble with this system.
Here's the patch again (3rd attempt):

*** TextOutputDev-prev.cc       Wed Feb 18 04:59:28 2009
--- TextOutputDev.cc    Wed Feb 18 05:42:22 2009
*************** void TextOutputDev::endMarkedContent(Gfx
*** 4657,4663 ****
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1];
  
        text->addChar(state,
                    actualText_x, actualText_y,
--- 4657,4663 ----
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff);
  
        text->addChar(state,
                    actualText_x, actualText_y,

Comment 4 Ross Moore 2009-02-25 15:35:55 UTC

The supplied patch has been tested with  pdftotext  and  evince.
viz.

On Tue, Feb 24, 2009 at 09:37:46AM +0100, The Thanh Han wrote:
> Hi Ross,
>
>On Tue, Feb 24, 2009 at 05:53:35AM +1100, Ross Moore wrote:
>> evince, okular and kpdf all use Poppler, right?
>> Were they rebuilt after applying my patch?
>
> oops I forgot this :(. Will do it now. But will take some
> time, there are way many build dependencies.

this turned out to be quite tough. I tried several ways and
ended up installing a very recent linux system (ubuntu 9.04
alpha), and rebuild poppler on that system with your patch.

The good news is that it works; I can copy from evince and
paste to gvim, and get the correct characters.

Comment 5 Albert Astals Cid 2009-03-29 14:53:06 UTC

Fixed thanks for the patch!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.