Bug 93344

Summary: pdftotext only outputs first page content with -bbox-layout option
Product: poppler Reporter: Jonathan Marchand <jonathlela>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: guiferrpereira, michaeldecerbo, urkle, ushakov, vinicius.rodrigues
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Get rid of ActualText class, move its functionality to TextPage

Description Jonathan Marchand 2015-12-11 08:57:46 UTC
The new -bbox-layout option introduced in 911d9fc8d85b776418039b4eebb37200a0987554 adds extra bounding box info. However it only displays the first page content, other pages are shown empty.
The -bbox option still works as intended.
By browsing the code, my guess is that comes from textOut->takeText() (line 528 from utils/pdftotext.cc) who get the TextPage content only in its first invocation.
Comment 1 ivan.zderadicka 2016-02-21 08:56:05 UTC
I can confirm same problem here -  version 0.40
Comment 2 Vladimir Ushakov 2016-05-16 16:32:10 UTC
Created attachment 123792 [details] [review]
Get rid of ActualText class, move its functionality to TextPage
Comment 3 Vladimir Ushakov 2016-05-16 16:37:52 UTC
The reason is the broken TextOutputDev::takeText(), it does not account for an extra reference to the page kept by the ActualText class.

The easy fix would be not to use takeText(), the right one on my opinion is to remove the ActualText class altogether, as its functionality is so tightly connected with the TextPage, so it makes no sense keeping them apart.

The patch in the previous comment does this.
Comment 4 Edward Rudd 2017-01-31 14:13:32 UTC
Is there any update on this?  This is a rather frustrating bug. As we now have to execute the program once for each page.
Comment 5 Albert Astals Cid 2017-01-31 19:08:05 UTC
Vladimir, if you could provide a much less "intrusive" patch that would be easier for me to integrate your fix.

Your patch touches CairoOutputDev and i don't have much knowledge about it, so if possible i'd like your "simpler" option instead of your "in my opinion this is more correct" option.
Comment 6 Louis St-Amour 2018-05-08 04:45:33 UTC
For future Googlers who might find this bug and are looking for an up-to-date version of Vladimir's patch, I've a fork on Github with an updated version of that patch applied: https://github.com/LouisStAmour/poppler/commit/67de9fe25214c9d9134621502bb90e08db6c227a

I won't say its perfect or well-tested, but it gets the job done for me. :)
Comment 7 GitLab Migration User 2018-08-20 21:49:01 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/88.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.