See also: https://lists.freedesktop.org/archives/poppler/2016-March/011755.html When extracting text from a landscape pdf file using the cpp interface, text at the far right of the page does not get extracted .I think the problem is that page.text() always assumes portrait orientation and hence underestimates the width of the page: p->text() p->text(p->page_rect()) Is this expected? What is the best way to extract all text from the page, irrespective of size and orientation? An example landscape pdf is here: https://github.com/ropensci/pdftools/files/161587/waurika_news_democrat.pdf
Any updates on this? It's really annoying that the C++ API basically doesn't work for landscape pages...
We don't have much free time, so if you could help us doing 3 test programs, one for the glib frontend, one for the Qt (4 or 5) and one for the C++ frontend and show that the glib and qt frontend behave correctly and the C++ one behaves "wrong" it would make it much easier to confirm this as a bug. And I guess at that stage you providign a patch to fix the C++ code by looking at the glib and qt code would not be too hard.
You can easily confirm this bug using the 'poppler-dump' utility included with the cpp interface: curl -OL https://github.com/ropensci/pdftools/files/161587/waurika_news_democrat.pdf ./cpp/tests/poppler-dump waurika_news_democrat.pdf --show-text physical You can compare this to the output from the pdftext utility you find that the cpp interface has dropped a few columns: ./utils/pdftotext waurika_news_democrat.pdf -layout /dev/stdout It would be really great if they the cpp api would give the same text as the pdftotext command line util.
This issue is still not fixed. I am now using a workaround in my applications that doubles the width for landscape pages. page *p(doc->create_page(i)); /* Workaround for bug https://github.com/ropensci/pdftools/issues/7 */ rectf target(p->page_rect()); if(p->orientation() == page::landscape || p->orientation() == page::seascape){ target.set_right(target.right() * 2); } /* Extract text */ ustring str = p->text(target, page::physical_layout); This is pretty ugly but it seems to at least get the text out of it.
Created attachment 133128 [details] [review] page_text_rotate.patch Take page rotation into account when extracting text. With this patch, the text is fully extracted from the PDF in comment #3
Created attachment 133129 [details] before.txt This is the output from cpp/tests/poppler-dump --show-text physical waurika_news_democrat.pdf before the patch above.
Created attachment 133130 [details] after.txt This is the output from cpp/tests/poppler-dump --show-text physical waurika_news_democrat.pdf after the patch above.
I'm a bit hesitant about that patch as since it would break the people that are passing the coordinates to make it work like it works right now. The only problem i see is that the documentation says \param rect if not empty, it will be extracted the text in it; otherwise, the text of the whole page so maybe the only thing we need to do is actually do a switch of coords in the if (r.is_empty()) { branch?
(In reply to Albert Astals Cid from comment #8) > I'm a bit hesitant about that patch as since it would break the people that > are passing the coordinates to make it work like it works right now. That's a good point. Maybe best to keep it how it is to avoid breaking any users. > so maybe the only thing we need to do is actually do a switch of coords in > the > if (r.is_empty()) { > branch? That would fix p->text() but some users may also use p->text(p->page_rect()) and then they won't get all the text. Quite a predicament!
(In reply to jalanpalmer from comment #9) > (In reply to Albert Astals Cid from comment #8) > > I'm a bit hesitant about that patch as since it would break the people that > > are passing the coordinates to make it work like it works right now. > > That's a good point. Maybe best to keep it how it is to avoid breaking any > users. > > > so maybe the only thing we need to do is actually do a switch of coords in > > the > > if (r.is_empty()) { > > branch? > > That would fix > p->text() > but some users may also use > p->text(p->page_rect()) > and then they won't get all the text. > > Quite a predicament! But the documentation doesn't say to use p->page_rect() to use the whole text, though yes you may be right it may happen, but then we can only say "you're getting what you asked for", the docu says pass no rectangle to get the whole text.
(In reply to Albert Astals Cid from comment #10) > But the documentation doesn't say to use p->page_rect() to use the whole > text, though yes you may be right it may happen, but then we can only say > "you're getting what you asked for", the docu says pass no rectangle to get > the whole text. Agreed, sounds good to me. Maybe a user gets confused, but worst-case scenario they end up here and then figure it out :)
Created attachment 133371 [details] [review] page_text_rotate.patch Patch as discussed, only switching the coords when necessary. Also had to update the poppler-dump test utility, since it was using the p->text(p->page_rect()) approach.
wouldn't it be better for the poppler-dump patch to pass an empty rect?
(In reply to Albert Astals Cid from comment #13) > wouldn't it be better for the poppler-dump patch to pass an empty rect? Haha, yes. I was too busy thinking about swapping coordinates to do the obvious....
Created attachment 133398 [details] [review] page_text_rotate.patch
Pushed :)
(In reply to Albert Astals Cid from comment #16) > Pushed :) Just one small issue with that commit: I'm not Jeroen Ooms :)
Very sorry about that, i fixed the commit.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.