Bug 94517

Summary: page.text() does not take page orientation into account
Product: poppler Reporter: Jeroen Ooms <jeroen>
Component: cpp frontendAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: jalanpalmer
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: page_text_rotate.patch
before.txt
after.txt
page_text_rotate.patch
page_text_rotate.patch

Description Jeroen Ooms 2016-03-12 20:42:43 UTC
See also: https://lists.freedesktop.org/archives/poppler/2016-March/011755.html

When extracting text from a landscape pdf file using the cpp
interface, text at the far right of the page does not get extracted .I
think the problem is that page.text() always assumes portrait
orientation and hence underestimates the width of the page:

  p->text()
  p->text(p->page_rect())

Is this expected? What is the best way to extract all text from the
page, irrespective of size and orientation?

An example landscape pdf is here:
https://github.com/ropensci/pdftools/files/161587/waurika_news_democrat.pdf
Comment 1 Jeroen Ooms 2016-08-10 14:17:37 UTC
Any updates on this? It's really annoying that the C++ API basically doesn't work for landscape pages...
Comment 2 Albert Astals Cid 2016-08-15 20:09:45 UTC
We don't have much free time, so if you could help us doing 3 test programs, one for the glib frontend, one for the Qt (4 or 5) and one for the C++ frontend and show that the glib and qt frontend behave correctly and the C++ one behaves "wrong" it would make it much easier to confirm this as a bug.

And I guess at that stage you providign a patch to fix the C++ code by looking at the glib and qt code would not be too hard.
Comment 3 Jeroen Ooms 2016-08-16 10:06:49 UTC
You can easily confirm this bug using the 'poppler-dump' utility included with the cpp interface:

  curl -OL https://github.com/ropensci/pdftools/files/161587/waurika_news_democrat.pdf
  ./cpp/tests/poppler-dump waurika_news_democrat.pdf --show-text physical

You can compare this to the output from the pdftext utility you find that the cpp interface has dropped a few columns:

  ./utils/pdftotext waurika_news_democrat.pdf -layout /dev/stdout

It would be really great if they the cpp api would give the same text as the pdftotext command line util.
Comment 4 Jeroen Ooms 2016-12-08 08:10:51 UTC
This issue is still not fixed. I am now using a workaround in my applications that doubles the width for landscape pages.

    page *p(doc->create_page(i));

    /* Workaround for bug https://github.com/ropensci/pdftools/issues/7 */
    rectf target(p->page_rect());
    if(p->orientation() == page::landscape || p->orientation() == page::seascape){
      target.set_right(target.right() * 2);
    }

    /* Extract text */
    ustring str = p->text(target, page::physical_layout);
 
This is pretty ugly but it seems to at least get the text out of it.
Comment 5 Jason Alan Palmer 2017-07-29 20:52:51 UTC
Created attachment 133128 [details] [review]
page_text_rotate.patch

Take page rotation into account when extracting text. With this patch, the text is fully extracted from the PDF in comment #3
Comment 6 Jason Alan Palmer 2017-07-29 20:56:18 UTC
Created attachment 133129 [details]
before.txt

This is the output from

  cpp/tests/poppler-dump --show-text physical waurika_news_democrat.pdf

before the patch above.
Comment 7 Jason Alan Palmer 2017-07-29 20:56:54 UTC
Created attachment 133130 [details]
after.txt

This is the output from

  cpp/tests/poppler-dump --show-text physical waurika_news_democrat.pdf

after the patch above.
Comment 8 Albert Astals Cid 2017-07-30 22:46:01 UTC
I'm a bit hesitant about that patch as since it would break the people that are passing the coordinates to make it work like it works right now.

The only problem i see is that the documentation says
   \param rect if not empty, it will be extracted the text in it; otherwise, the
               text of the whole page
so maybe the only thing we need to do is actually do a switch of coords in the 
   if (r.is_empty()) {
branch?
Comment 9 Jason Alan Palmer 2017-07-31 13:28:48 UTC
(In reply to Albert Astals Cid from comment #8)
> I'm a bit hesitant about that patch as since it would break the people that
> are passing the coordinates to make it work like it works right now.

That's a good point. Maybe best to keep it how it is to avoid breaking any users.

> so maybe the only thing we need to do is actually do a switch of coords in
> the 
>    if (r.is_empty()) {
> branch?

That would fix
    p->text()
but some users may also use
    p->text(p->page_rect())
and then they won't get all the text.

Quite a predicament!
Comment 10 Albert Astals Cid 2017-07-31 20:18:51 UTC
(In reply to jalanpalmer from comment #9)
> (In reply to Albert Astals Cid from comment #8)
> > I'm a bit hesitant about that patch as since it would break the people that
> > are passing the coordinates to make it work like it works right now.
> 
> That's a good point. Maybe best to keep it how it is to avoid breaking any
> users.
> 
> > so maybe the only thing we need to do is actually do a switch of coords in
> > the 
> >    if (r.is_empty()) {
> > branch?
> 
> That would fix
>     p->text()
> but some users may also use
>     p->text(p->page_rect())
> and then they won't get all the text.
> 
> Quite a predicament!

But the documentation doesn't say to use p->page_rect() to use the whole text, though yes you may be right it may happen, but then we can only say "you're getting what you asked for", the docu says pass no rectangle to get the whole text.
Comment 11 Jason Alan Palmer 2017-07-31 22:11:30 UTC
(In reply to Albert Astals Cid from comment #10)
> But the documentation doesn't say to use p->page_rect() to use the whole
> text, though yes you may be right it may happen, but then we can only say
> "you're getting what you asked for", the docu says pass no rectangle to get
> the whole text.

Agreed, sounds good to me. Maybe a user gets confused, but worst-case scenario they end up here and then figure it out :)
Comment 12 Jason Alan Palmer 2017-08-07 22:41:42 UTC
Created attachment 133371 [details] [review]
page_text_rotate.patch

Patch as discussed, only switching the coords when necessary.

Also had to update the poppler-dump test utility, since it was using the p->text(p->page_rect()) approach.
Comment 13 Albert Astals Cid 2017-08-08 20:05:59 UTC
wouldn't it be better for the poppler-dump patch to pass an empty rect?
Comment 14 Jason Alan Palmer 2017-08-08 22:12:54 UTC
(In reply to Albert Astals Cid from comment #13)
> wouldn't it be better for the poppler-dump patch to pass an empty rect?

Haha, yes. I was too busy thinking about swapping coordinates to do the obvious....
Comment 15 Jason Alan Palmer 2017-08-08 22:13:32 UTC
Created attachment 133398 [details] [review]
page_text_rotate.patch
Comment 16 Albert Astals Cid 2017-08-08 22:43:50 UTC
Pushed :)
Comment 17 Jason Alan Palmer 2017-08-09 10:19:07 UTC
(In reply to Albert Astals Cid from comment #16)
> Pushed :)

Just one small issue with that commit: I'm not Jeroen Ooms :)
Comment 18 Albert Astals Cid 2017-08-09 17:35:13 UTC
Very sorry about that, i fixed the commit.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.