Bug 103798 - libpoppler cannot recreate pdftotext output, because physical_layout is not handled correctly
Summary: libpoppler cannot recreate pdftotext output, because physical_layout is not h...
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: cpp frontend (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-17 17:02 UTC by dummydummy
Modified: 2018-08-21 11:16 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
new_poppler-page.cpp based on poppler 0.61.1 with my changes (7.60 KB, text/x-c++src)
2017-11-18 13:34 UTC, dummydummy
Details
new_poppler-page.h based on poppler 0.61.1 with my changes (2.11 KB, text/x-chdr)
2017-11-18 13:35 UTC, dummydummy
Details
output of diff -u poppler-page.cpp (from poppler 0.61.1) new_poppler-page.cpp (749 bytes, patch)
2017-11-18 13:36 UTC, dummydummy
Details | Splinter Review
output of diff -u poppler-page.h (from poppler 0.61.1) new_poppler-page.h (454 bytes, patch)
2017-11-18 13:37 UTC, dummydummy
Details | Splinter Review
output of git diff > git-diff.txt (1.27 KB, text/plain)
2017-11-26 16:47 UTC, dummydummy
Details

Description dummydummy 2017-11-17 17:02:49 UTC
Dear maintainer, this bug concerns poppler 0.48.0 up to at least 0.60.1

in file .../gcc/poppler-page.cpp

the function     
         ustring page::text(const rectf &r, text_layout_enum layout_mode) const

when called with  physical_layout  as  layout_mode  incorrectly creates a 
TextOutputDev with second parameter (supposed to be true for physical_layout) always set to gFalse, because the corresponding code in lines 272 and 273 (poppler 0.60.1) are 

    const GBool use_raw_order = (layout_mode == raw_order_layout);
    TextOutputDev td(0, gFalse, 0, use_raw_order, gFalse);


By contrast the pdftotext.cc creates TextOutputDev with second parameter set to gTrue when called with the -layout command line option.

THE EFFECT, is that the text produced inside a program using libpoppler differs from the more faithful text (which has, for example, blank lines where required) produced by invoking pdftotext with the -layout option.

Would the following be a solution?
    const GBool use_raw_order = (layout_mode == raw_order_layout);
    const GBool use_physical_layout = !use_raw_order;
    TextOutputDev td(0, use_physical_layout, 0, use_raw_order, gFalse);

I would be grateful, if this could be fixed.
The alternative I do not relish, would appear to be to compile virtually all of the poppler source code into my program, just to give it access to TextOutputDev and thus be able to call it with gTrue as second parameter. This does not appear to be what libpoppler is supposed to be for.
Comment 1 Albert Astals Cid 2017-11-17 21:52:21 UTC
as far as i see we need another enum, there's three modes posible

raw
not raw and physical layout 
not raw and not physical layout

would you be able to work on a patch?
Comment 2 dummydummy 2017-11-18 13:34:32 UTC
Created attachment 135573 [details]
new_poppler-page.cpp based on poppler 0.61.1 with my changes
Comment 3 dummydummy 2017-11-18 13:35:29 UTC
Created attachment 135574 [details]
new_poppler-page.h based on poppler 0.61.1 with my changes
Comment 4 dummydummy 2017-11-18 13:36:49 UTC
Created attachment 135575 [details] [review]
output of diff -u poppler-page.cpp (from poppler 0.61.1) new_poppler-page.cpp
Comment 5 dummydummy 2017-11-18 13:37:17 UTC
Created attachment 135576 [details] [review]
output of diff -u poppler-page.h (from poppler 0.61.1) new_poppler-page.h
Comment 6 dummydummy 2017-11-18 13:39:44 UTC
Thank you very much for the speedy reply! 

My first ever attempt at using git failed:
$ git clone https://anongit.freedesktop.org/git/poppler/poppler.git
Cloning into 'poppler'...
fatal: unable to access 'https://anongit.freedesktop.org/git/poppler/poppler.git/': gnutls_handshake() failed: Public key signature verification has failed.

This is a bug from sept. 2016 "Bug#835342: curl or git clone commands throws "gnutls_handshake() failed" on https targets" https://groups.google.com/forum/#!topic/linux.debian.bugs.dist/nUrVMIG-r0o

So I just downloaded poppler-0.61.1.tar.xz and 
used diff -u as follows to produce the attached files:
$ diff -u poppler-page.cpp new_poppler-page.cpp >fix_cpp.patch
$ diff -u poppler-page.h new_poppler-page.h >fix_h.patch

The changes are fairly small:

in poppler-page.h:
 added a further enum default_layout in line 50 and comments in the 2 lines above

in poppler-page.cpp:
 added the following line after line 272
     const GBool use_physical_layout = (physical_layout == layout_mode);
 used the new variable use_physical_layout in line 274 as second parameter to TextOutputDev
 and for cosmetic reasons, swapped the order of layout_mode and raw_order_layout in line 272 (if ever I mistype == as =, this order makes the compiler complain :-)

Is there anything else I can do to resolve this problem?
Comment 7 dummydummy 2017-11-18 17:35:29 UTC
Unfortunately the preceding patch does not resolve the problem on its own.

Comparing main() of pdftotext.cc with the page::text() in .../gcc/poppler-page.cpp there are 2 other differences:

(1) doc->displayPage(&td, d->index + 1, 72, 72, 0, FALSE, TRUE, false); vs doc->displayPages(textOut, firstPage, lastPage, resolution, resolution, 0, TRUE, FALSE, gFalse);
i.e. page::text () sets GBool useMediaBox = false, GBool crop = true and in pdftotext it is the other way round!?

I have yet to investigate what this means by reading the source...

(2) the page::text() in .../gcc/poppler-page.cpp subsequently calls td.getText() while pdftotext does not... Yet more source code to wade through...

to be continued...
Comment 8 dummydummy 2017-11-19 18:06:40 UTC
When a TextOutputDev is created with a NULL first argument for output filename, then a call to doc->displayPage(with such a TextOutputDev as first argument) will not generate any output. Instead TextOutputDev->getText () will attempt to assemble the fragments resulting from parsing the top level PDF object during doc->displayPage(...) approximately matching the correct physical layout.

This is what happens in libpoppler, but the results differ from those produced by pdftotext.
 
pdftotext creates a TextOutputDev with a (non-NULL) first argument for the output filename. In this case, a call to doc->displayPage(with such a TextOutputDev as first argument) will generate output to the filename (possibly via Gfx->display () ?).

The poppler code thus appears to have two routines which are not quite duplicates of one another for the same purpose of producing the text disposed according to the physical layout in a string variable!? 

Such a (historical?) architecture is just a recipe for problems.

Is someone attempting to fix this? (This could be a major job)
Comment 9 Albert Astals Cid 2017-11-20 21:24:37 UTC
Please send a git diff, it's much easier to read.

Use git://anongit.freedesktop.org/poppler/poppler

where did you get the https line from?


You lost me a bit in having two comments one after the other, could you please try to write/summarize it again in a single comment?
Comment 10 dummydummy 2017-11-26 16:46:38 UTC
The https line is from https://poppler.freedesktop.org/
"
   Poppler is developed using git. To clone the repository use the following command:

   git clone https://anongit.freedesktop.org/git/poppler/poppler.git
"
but it does not work with Debian 9 "Stretch" (stable).

Thank you for the address which works.

The command 
           git diff
seems to display the diff on the screen. I just redirected it into a file
with 
           git diff >git-diff.txt
If there is a better way, please let me know (I have never used git before)

The proposed modification ensures that the the function     
         ustring page::text(const rectf &r, text_layout_enum layout_mode) const
(in file .../gcc/poppler-page.cpp)
when called with  physical_layout  as  layout_mode correctly creates a 
TextOutputDev with second parameter set to true for physical_layout.


HOWEVER, even with this change, it is NOT possible to obtain the same result as the output from pdftotext using libpoppler (The layout is different, for example there are no blank lines whereas pdftotext adds them when needed).
 
The reason is that pdftotext creates a TextOutputDev with a filename (the output file name in fact) as first parameter and page::text(...) instead creates a TextOutputDev with a NULL as first parameter (as there is no output filename).

When this TextOutputDev is subsequently passed into doc->displayPage(...), the PDF-page is apparently parsed into "fragments". When the filename was provided the text is simultaneously written into the output file (respecting the physical_layout when required).
With the libpoppler function, only the parsing occurs (as there is no output filename). To obtain the text in physical layout a different function (TextOutputDev::getText (...)) is subsequently called to assemble the parsed fragments from doc->displayPage(...). However, it does so differently to doc->displayPage(...). This is why libpoppler cannot re-create the output of pdftotext.

To summarise: the poppler code has currently two different functions for providing text in physical_layout (one in doc->displayPage(...) and a different, inferior one in TextOutputDev::getText (...)) - Is this really necessary?
Comment 11 dummydummy 2017-11-26 16:47:43 UTC
Created attachment 135731 [details]
output of git diff > git-diff.txt
Comment 12 Albert Astals Cid 2018-01-04 21:46:00 UTC
Can I please have your name for for proper copyright attribution?

> To summarise: the poppler code has currently two different functions for
> providing text in physical_layout (one in doc->displayPage(...) and a 
> different, inferior one in TextOutputDev::getText (...)) - Is this really 
> necessary?

Probably not, we'll gladly accept a patch that makes it better :)
Comment 13 GitLab Migration User 2018-08-21 11:16:33 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/598.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.